Texture Aware Deep Feature Map Based Linear Weighted Medical Image Fusion

Medical image analysis is a critical job for clinicians and radiologists to attain minute insights for proper diagnosis. The presence of complementary details of the region of interest (ROI) from multiple medical imaging modalities instigates the researchers to integrate or combine the pathological details for the ease of clinical diagnosis. In this paper, the objective is to obtain a comprehensive image that presents composite image details from the two multimodal images of the same ROI. The basic idea is to generate robust fusion weights in the form of individually weighted matrices that could potentially superintend the fused outcome from the input image matrices. The extraction of texture features comes into play with the employment of the fast gray level co-occurrence matrix-mean technique. The feature maps of the source images are derived from the convolution layers on which the texture analysis is done to evaluate a weight map. Linear weights-based spatial domain fusion is employed using the weight map. Post auditioning several relevant fusion strategies and baseline hyper-parameter tuning, the obtained sets of outputs are validated via objective analysis in terms of standard metrics and compared with other fusion methods.


I. INTRODUCTION
Multimodal medical image fusion, being an auxiliary approach, assists doctors to diagnose smoothly by leveraging information enhancement from multiple imaging modalities. The objective of image fusion is to integrate details from different parent images of the ROI to derive a comprehensive image that provides composite visual details from the multimodal images [1], [2]. When compared with the parent images, the visual information contained in the fused image is found to be much more detailed. It has the capacity to enhance the amount of visual information which will reduce the redundancy of information present in two or more images. Image fusion is predominantly employed in medical image diagnosis, remote sensing, agriculture, surveillance, and navigation.
The pathological analysis of ROI for disease diagnosis is possible by inspecting multimodal medical images such as The associate editor coordinating the review of this manuscript and approving it for publication was Rajeswari Sundararajan .
Computed Tomography (CT) image, Magnetic Resonance Imaging (MRI), X-ray, Positron Emission of Tomography (PET), and Single Photon Emission Computed Tomography (SPECT) [3]. The fusion of CT and MRI presents anatomical and functional information in the composite image that makes the diagnosis less laborious for the clinicians. Image fusion methods are classified into pixel level fusion, feature level fusion, and decision level fusion [4]. In the first category, we tend to process raw pixel values with the parent image details and optimally retain a good chunk of original information. The method of feature level fusion operates at the point, angle, edge, texture, and other features extracted from the source images. The decision level fusion is carried out on the information extracted via low and mid-level image processing. In decision level fusion, both redundancy and uncertain information can be reduced while retaining the useful information present in the source images to serve image analysis better. This paper focuses on obtaining a single image, which presents better information by fusing two multimodal medical images. The medical modality known as MRI reveals the functional abnormalities of organs/tissues, whereas CT exposes them on an anatomical level. Thus, for more detailing in one go, the proposed image fusion technique stands apart and can be performed in vivid variants.
The exploration of the robust capability of a Deep Learning (DL) network helps to extract informative features and data representation. DL has been leading the state-of-theart results in several computer vision and image processing operations [5]. The standard fusion practices follow step-wise max fusion strategy to club individual fused feature maps at the end. On the other hand, we are going with the formation of individual textural matrices with the aid of fast Gray Level Co-occurrence Matrix (GLCM)-mean technique followed by the genesis of individual weight matrices containing the fusion weights [6]. The feature maps of MRI and CT are derived using two levels of convolution layers. The average of feature maps is obtained concerning individual modality and then the fast-GLCM is applied to extract the texture feature maps of MRI and CT. Upon applying specific criteria, a weight map is obtained from the two fast-GLCM feature maps. This weight map is used to carry out spatial domain fusion for the source images. The performance of the proposed fusion method is compared with other fusion methods using the standard fusion metrics. This paper is organized into different sections explaining the vivid angles of the proposition, starting with the literature review. In the literature review, a study of multimodal image fusion strategies using deep learning is conducted. Three types of decision functions are presented in section 3. Upon dataset acquisition, the intuitive interpretation is backed by the implementation and the results are attested in Section 4. This is followed by a conclusion in section 5.

A. LITERATURE SURVEY
Fayez and Sabine et al. proposed a novel image fusion model which is based on the Visual Geometry Group (VGG)-19 and softmax operator [7]. The proposed fusion model uses the weighted fusion technique. VGG-19 is used to extract feature maps from CT and MRI images, which are then processed by the softmax operator to generate weights needed for weighted fusion.
The most primitive setup with which we commenced our fusion algorithm is the Zero Learning Medical Image Fusion (ZLMIF) technique [7]. As discussed above, the fundamental idea is to provide individual deep feature maps in terms of numeric vectors as potential inputs to the softmax operator, which then would convert them into a vector of probabilities. The normalized numerics can be employed as fusion weights for individual feature maps, respectively, which would give rise to probable weight maps followed by clubbing them all to attain a final fused image.
Zhang et al. published a method that revolves around the proposition of a general fusion framework for varied forms of datasets which include infrared and visible images, multifocus images, MRI/CT images of the brain and multi-exposure images [4]. They used different fusion strategies for each type of input dataset. For performing a comprehensive fusion of infrared, multifocus, and medical images, max fusion is employed whereas for the multi-exposure images, mean fusion is used.
Inspired by this framework which is solely based on transform-domain image fusion algorithms, we moved ahead with a convolutional neural network, which would consist of feature extraction module, feature fusion module, and image regeneration module [4].
Fu et al. proposed a fusion model that uses a rolling guidance filter and VGG-16 convolutional network [8]. The rolling guidance filter produces a base image and a detail image. The convolutional neural network (CNN) produces a perceptual image. MRI and CT images are given as input to a rolling guidance filter and CNN to produce altogether three pairs of images. Base images are fused by local energy maximum fusion rule, detail images by local variance max fusion rule and perceptual images using sum modified Laplacian maximum fusion rule. At last, all the three fused images are bundled to get the final fused output.
Nishant et al. presented an unsupervised CNN model for the fusion of high and low-frequency components of MRI-PET source image pairs by exploiting structural similarity index (SSIM) as the loss function during training [9]. The authors suggested an application of color coding to visualize the outcome upon respective quantification of each input image in terms of the partial derivatives of the fused image.
Zhang et al. proposed a medical image fusion model that is specifically based on DenseNet, which aspires for feature reuse by interconnecting the features over channels. This enables the algorithm to perform better than conventional models with fewer parameters and calculation costs [10]. Nasrin and Ahmad proposed a method using VGG19, a pretrained network, for the fusion of MRI and PET scans. The weights for the fusion were extracted from the features of pretrained CNN layers [11].

B. DECISION FUNCTIONS
Based on the analysis of existing fusion methods, it is decided to specifically focus on the fusion module of the architecture and hence, went on to scrutinize the three different fusion strategies. Their respective intuitions can be briefly illustrated as follows:

1) GLCM ENERGY-BASED DECISION FUNCTION
The features are extracted from the source images using two convolutional layers. The depth of the first convolution layer is 64. Hence, 64 feature maps are generated for the MRI and CT images. For each feature map, energy is evaluated. E i 1 and E i 2 are the energy of the feature maps, where 'i' changes from 1 to 64. The feature fusion is governed by the following criteria.
Append CT feature map 88788 VOLUME 10, 2022 else Append MRI feature map end

2) GLCM ENERGY AND CONTRAST-BASED DECISION FUNCTION
For each feature map, energy and contract values are evaluated. C i 1 and C i 2 are the contrast of the feature maps of CT and MRI source images respectively. where 'i' represents the number of feature maps changing from 1 to 64. Similarly E i 1 and E i 2 are the energy of the feature maps. The fusion strategy is presented as follows.
• Let a, b, c, and d be the variables initialized to zero. The count of the variables will be increased according to the stated criteria.
Append MRI feature map end end

C. SSIM-BASED DECISION FUNCTION
SSIM is evaluated between the fused and ground truth images, thus returning a numeric oscillating in the range of 0 to 1. The score can be computed each time by taking a specific ground truth image as a reference image and one of the medical modalities in the form of a feature map of the same scene as the processed image [12]. Thus, the basic idea is to exploit this concept to accumulate robust local feature maps with ground truth images. The fusion decision function based on the SSIM score is stated below Append MRI feature map end In the above-said decision functions for fusion, it is observed that the feature map selection is carried out using GLCM and SSIM. From the selected feature maps of the source medical images, the fused image is reconstructed using a reconstruction module. Lastly, we tried to explore the technicalities in the regeneration phase of the setup, and as a result, we moved to the FunFuseAn framework. The post feature extraction in which the fusion of high and low-frequency components of MRI-CT grayscale image pairs can be done separately by exploiting SSIM as the loss function during training [8]. The idea of separately handling the frequency components is executed to avoid loss as well as the mismatch of information contained in the fused outcome.

II. PROPOSED METHODOLOGY
In the DL-based fusion networks, the features are extracted by the convolution layers and fused using specific fusion criteria, as shown in Fig. 1. Then, the reconstruction module delivers the fused image from the fused features. In this paper, we proposed a Texture aware Deep Feature map-based linear weighted Image Fusion model (TDFIF). The model tends to work in two primed phases, namely the training phase and the fusion phase. The medical imaging modalities as potential inputs are primarily fed into the proposed network followed by the training procedure being done on it. In the fusion phase, a single pair of MRI and CT images is given as input to the trained model to get the fused output.
The basic idea is to generate robust fusion weights in the form of a weight matrix that could potentially superintend the fusion outcome upon encountering the input image matrices. To be precise, three specific decision rules can be decided based on the probable inequalities between the corresponding pixels of two texture matrices. It is for the generation of robust fusion weights in the form of individual weight matrices. Finally, upon respective encounter with the input image matrices followed by a linear weighted addition, the final fused image can be obtained which might be potentially vouched for its composite image details, unlike the source images.

A. FEATURE EXTRACTION MODULE
In this module, there are two convolution layers; the first convolution layer has a kernel of size 3 × 3 with one input as well as 64 output channels, whereas the second layer consists of a kernel of size 3 × 3 with both the input as well as output channels having frequency 64. Moreover, the padding and stride are fixed as unity to make sure that the VOLUME 10, 2022  size of the feature map doesn't slip. Upon feature extraction of individual modalities, the obtained respective feature maps are summed up independently to produce two summed-up maps. The detailed block diagram of the feature extraction module is shown in Fig. 2.

B. LOSS FUNCTION
The subjective analysis of the fusion outcome depends on the local luminance, contrast, and structural properties of the image. That's the reason for considering SSIM [13] as a loss function that is solely based on human perception. The fusion method with loss function is presented in Fig. 3.
SSIM requires two images, a reference image and a processed image, and returns a numeric oscillating in the range of 0 to 1. The score can be computed each time by taking a specific ground truth image as a reference image and one of the medical modalities (in the form of a feature map) of the same ROI as a processed image. The mathematical interpretation for the same is mentioned below.
Here, I 1 and I 2 are the two potential inputs (medical imaging modalities), N is the number of local windows in I 1 and I 2 , i k and j k are k th local image contents of images, I 1 and I 2 , respectively. We have assumed the values of α, β and γ as unity stating the clear message that all the three primed properties, namely; structural, contrast, and luminance are given the same weightage.
The above equations 2a, 2b, 2c describe the luminance, contrast and structural properties of local image contents i k and j k . µ i k and µ j k are mean; σ i k and σ j k are the standard deviations of the image pixel values; σ g is the standard deviation of the Gaussian filter, and σ i k j k is the correlation coefficient.
The pixel loss, L2 which tends to preserve better luminance, is experimented in addition to SSIM The steerable total loss function is expressed as: where, where I 1 and I 2 are the two source images and F is the final fused image.

C. FEATURE FUSION MODULE
After feature extraction from the individual modalities, the obtained respective feature maps are summed up independently to produce two summed-up maps. The feature map sum is then divided by the number of feature maps to get a feature map average F avg . Here, F sum is the feature map sum, F i is i th feature map. The normalization of grey levels is done to adjust the numeric in the feature map sum to a common scale, without distorting differences in the range of values and hence, we attain the average of all the feature maps as depicted in the equations below.
Then, texture feature extraction is employed with the use of fast GLCM-Mean technique assisted by pre-computed numerical value in the form of F avg . In this way, the two independent textural matrices for individual modalities are obtained.
We have employed GLCM texture features based decision rule for the fusion of decided modalities [6]. Specifically, we used the fast GLCM-Mean technique to extract the second-order statistical texture features from the brain image. The texture in an image is all about how one level is co-occurring with the other. GLCM is a matrix containing all the probable frequencies of co-occurrences of each neighbouring level. The numerics in the GLCM signify the frequency of occurrences of a specific pair of pixels with a particular value concerning a specific spatial relationship. Preserving texture in the fused image obtained from the source modalities is alarmingly essential in the case of medical image fusion, as texture details help in classifying whether the image contains abnormalities or not. The equation for calculating GLCM mean for k th local image content is where, µ k is the GLCM mean of k th local image content, P is the GLCM matrix of k th image, and i being the reference pixel value. Here, P k (i, j) represents the probability of pixel value i and j occurring side by side in the k th local image. A new texture feature map is formed by assigning the numerical value µ k at the center of the local image window in the form of T A : MRI and T B : CT . The fused image is obtained through the weights that are derived from the three fusion rules based on the probable inequalities between the corresponding pixels of two textural matrices. It is done for the generation of robust fusion weights in the form of individual weight matrices.
where W A and W B are weight maps of MRI and CT respectively. T A and T B are textural matrices of MRI and CT respectively. Finally, upon respective encounters with I 1 and I 2 , the source image matrices, followed by linear addition, the final fused image matrix is attained, as mentioned in equation 8, shown in Fig. 4 & Fig. 5. This output image seems to be a potential candidate of possessing comprehensive richness for pathological analysis.

III. RESULTS AND ANALYSIS
The performance of the proposed method is validated by the set of images and analyzed with other fusion methods. Dense shift invariant transform (DSIFT), sparse representation (SR) fusion, ZLMIF, image fusion framework based on CNN (IFCNN) FunFuseAn, and VGG19 [7] are the methods used for comparison. The first experiment among the three experiments is about analyzing the fusion metrics for the four image pairs in the dataset. The source image pairs and fusion outputs are presented in Fig. 6, 7, 8 & 9.

Algorithm 1 Fusion of Extracted Feature Map Average F A and F B Input Extracted mean feature maps F A and F B .
Steps 1) Extract GLCM mean feature map from F A and F B . P A and P B are GLCM matrices of F A and F B respectively.
2) Using the SSIM-based decision functions, generate weight maps W A and W B 3) Generate fused image by multiplying W A and W B with A and B and then adding the products, where A and B are MRI and CT images respectively.
The metrics used for performance analysis are quality metric (Q mi ) [6], [14], and feature mutual information (FMI)pixel [6], [14]. The second experiment is about edge preservation analysis using detect correct similarity (DCS) [14] metric, contrast based metric based on local similarity (Q Y ), Contrast based quality metric(Q cb ) [15], and SSIM [9]. The proposed method is tested on all the image pairs and the mean values of the standard metrics are evaluated in the third experiment. VOLUME 10, 2022   the MRI sequences that are solely considered for the training purpose.
The images are registered brain images of different modalities. In some cases, the multimodal brain images are offered with fused ground truth images. The complete fusion dataset for multimodal images is not available, and hence the registered multimodal brain images are selectively taken with appropriate preregistration. The 268 source image pairs are obtained similarly. The fusion dataset derived from the website is used for training and testing. The ground truth images VOLUME 10, 2022 are derived from the well-known fusion strategies and subsequently used for training and testing.

B. PERFORMANCE METRICS
The performance evaluation metrics adhere to the category of non-reference-based metrics. Q mi and FMI-pixel deliver the metric score based on mutual statistical information. DCS metric tends to extract the edge similarity between parent modalities and the fused outcome. SSIM and Q cb are quality assessment metrics based on Human Visual System (HVS), which employs structural similarity to compute the metric score. We have chosen three deep learning-based models, namely, ZLMIF, IFCNN, and FunfuseAn, for a relative assessment report concerning our proposed framework.

2) DETECT CORRECT SIMILARITY
It is the ratio of edge pixels present in both I 1 & I 2 and edge pixels present in I 1 but not in I 2 . DCS reveals the similarity between two images based on edge pixels. Higher the value of DCS, better the similarity between two images [16].

DCS =
Edge pixels present in both I 1 and I 2 Edge pixels present in I 1 but not in I 2 (10)

3) STRUCTURAL SIMILARITY INDEX
SSIM is a quality evaluation metric that considers contrast, variance and luminance to measure the structural similarity between images [17]. SSIM takes the values ranging from 0 to 1. Values of SSIM close to 0 reveal less similarity whereas values close to 1 reveal the high similarity between the images.

4) QUALITY METRIC BASED ON LOCAL SIMILARITY
Q Y is a metric that employs local structural similarity between source images as a measure. The local structural similarities of a window w are calculated which are SSIM(x, y w ), SSIM(x, f w ) and SSIM(y, f w ) where x and y are source images and f is fused image. is the local weight s(I w 1 ) and s(I w 2 ) are local variances of window w.

5) FEATURE MUTUAL INFORMATION
FMI_Pixel is mutual information-based metric that calculates the mutual information and entropies regionally.
where H i (CT ), H i (MRI ), and H i (F) are the entropies evaluated locally of the source images CT,MRI, and fused image F respectively. p(x, y) is the joint probability distribution of random variables x and y. p(x) and p(y) are probability distribution functions of random variables x and y respectively. I i is mutual information. n is the number of local regions.

6) CONTRAST-BASED QUALITY METRIC
Q cb employs the major features in the human visual system model which is a perceptual quality measure. It uses the contrast sensitivity function to describe human sensitivity to contrast. .
where C A , C B , and C F are contrast maps of source images A,B, and fused image F respectively.

C. FUSION PERFORMANCE ANALYSIS
The fusion methods are employed on the four sets of source image pairs and the metrics are presented below.

1) MUTUAL INFORMATION-BASED METRICS
It is observed from the metrics that the TDFIF delivers good information transfer from the source images to the fused image. This could be observed by analyzing Q mi . The TDFIF delivers superior results compared to all other methods. This metric is evaluated considering a complete source and fused images. But, FMI −pixel is the metric evaluated based on the mutual information of local regions. Due to the contributions of the few local regions, the average information is high for the other methods. The performance of the proposed method for all the image pairs in the dataset is presented in Table 1 & 2.

D. QUALITATIVE PERFORMANCE ASSESSMENT
The qualitative performance analysis of TDFIF is carried out using edge preservation, contrast, variance, and the structural similarity between the fused image and source images.

1) EDGE PRESERVATION ANALYSIS
The edge preservation capability of the proposed method is analyzed by DCS using two edge operators. The edge similarity is evaluated and the mean value is observed for analysis [16]. The DCS metric for four sets of images is presented in Table 3. It could be observed from the tabulated values that the proposed method preserves edges better than deep learning-based fusion methods. The other two methods, DSIFT and SR fusion, perform better than TDFIF.

2) CONTRAST AND VARIANCE BASED ANALYSIS
Q Y is the local structural similarity measure using SSIM. The proposed method delivers good local similarity for the two sets of image and performs moderately well for the other two image pairs as presented in Table 4. Q cb is the contrast sensitivity-based metric in which DSIFT tops the performance metrics. Whereas, the proposed method performs moderately well among the DL-based methods. From Table 5, it could be observed that similar performance is reflected in DCS. SSIM is another qualitative metric that analyzes the structural similarity between the source and fused images considering contract, variance, and illumination. The evaluated values are presented in Table 6. The SSIM values of the proposed method are better compared to other fusion methods taken for analysis except for VGG19 based fusion method.

E. DEPICTION OF RELATIVE ASSESSMENT OF FUSION METHODS
The analysis of metrics among the DL-based methods leads to ranking the performance. The TDFIF method tops the   ranking, as shown in Fig. 10, as the performance is good in Q mi , Q Y and SSIM . Compared to other DL-based methods, its performance is moderately good in other metrics. The output images of the proposed method are subjectively superior compared to other fusion methods.

F. ANALYSIS OF SEGMENTED ROI AFTER FUSION
The impact of fusion in segmentation is analyzed by segmenting the source and fused images using fuzzy C-Means (FCM) clustering algorithm. The images are segmented into five clusters, then the segmented regions are presented in Fig. 11. It could be observed that the details present in the source images are fused and presented in the segmented region of the fused image. This would help in analyzing the ROI from the single fused image.

G. FEASIBILITY AND FUTURE SCOPE
To throw light on the future scope of the existing proposition, one could ponder over strengthening the primed modules, specifically, feature fusion as well as feature regeneration. The former could be strengthened upon the employment of unique fusion strategies which could potentially impact the fused outcome. Moreover, one could also look for vivid classifiers apart from the concept of decision mapping which could potentially assist the process efficiently. Now, for strengthening the latter module, one could go for the in-detail examination of the regeneration layers to interpret the happenings within the FC phase of the network. The feature maps' comprehensive visualization would do in this case.  One could opt for making this module insignificant by vouching for weighted fusion as it eliminates the anticipated biases. Apart from strengthening the individual modules, one could try for baseline hyper-parameter tuning and robust training approaches (with the use of appropriate loss function, enhancing the frequency of the inputs via data augmentation, etc), multi-modal inputs such as PET, SPECT, etc apart from the standard medical imaging modalities, etc. Thus, the existing proposition can be potentially corroborated as robustly feasible as well as scalable from the technical stand-point as depicted in Tables 1 -6.

H. FUSION PERFORMANCE ON THE DATASET
The proposed TDFIF method is tested on all the image pairs in the dataset and the mean of metrics is presented in Table 7. It is observed that the qualitative and quantitative performances are moderately good for the proposed method.

IV. CONCLUSION
Multimodal fusion plays a vital role in combining complementary image details, thus eliminating the redundancy present among the multiple medical images of the same ROI. This paper evaluates statistical parameters from the GLCM matrices of the feature maps. The feature maps are derived from the source images using two sets of convolution layers. The decision function-based weights are derived from the GLCM matrix of the feature maps. It could be observed that the proposed TDFIF with SSIM-based decision function can deliver good fusion results subjectively. The performance is evaluated by the standard fusion metrics and also compared with other fusion algorithms. The objective evaluation is also good compared to other fusion methods. DHEERAJ KANDIKATTU is currently pursuing the Bachelor of Technology degree in electronics and communication engineering with the Vellore Institute of Technology (VIT University), Chennai Campus. His research interests include deep learning and machine learning. He is also into competitive programming.
UTKARSH is currently pursuing the Bachelor of Technology degree in electronics and communication engineering with the Vellore Institute of Technology (VIT University), Chennai Campus. His research interests include data analytics, machine learning, and blockchain. More specifically, his research interests involve incorporating a data-driven approach in interpreting stuffs and attaining a standpoint, which is technically feasible and robust from business point of view.
MUKUL KUMAR is currently pursuing the Bachelor of Technology degree in electronics and communication engineering with the Vellore Institute of Technology (VIT University), Chennai Campus. His research interests include data science and software development. In particular, he has developed a strong interest in data analytics and machine learning techniques, which can be used to extract valuable insights from large data sets.