Noninvasive COVID-19 Screening Using Deep-Learning-Based Multilevel Fusion Model With an Attention Mechanism

The current pandemic has necessitated rapid and automatic detection of coronavirus disease (COVID-19) infections. Various artificial intelligence functionalities coupled with biomedical images can be utilized to efficiently detect these infections and recommend a prompt response (curative intervention) to limit the virus’s spread. In particular, biomedical imaging could help to visualize the internal organs of the human body and disorders that affect them. One of them is chest X-rays (CXRs) which has widely been used for preventive medicine or disease screening. However, when it comes to detecting COVID-19 from CXR images, most of the approaches rely on standard image classification algorithms, which have limitations with low identification accuracy and improper extraction of key features. As a result, a convolutional neural network (CNN)-based fusion network has been developed for automated COVID-19 screening in this study. First, using attention networks and multiple fine-tuned CNN models, we extract key features that are resistant to overfitting. We then employ a locally connected layer to create a weighted combination of these models for final COVID-19 detection. Using a publicly available dataset of CXR images from healthy subjects as well as COVID-19 and pneumonia cases, we evaluated the predictive capabilities of our proposed model. Test results demonstrate that the proposed fusion model performs favorably compared to individual CNN models.


I. INTRODUCTION
T HE PANDEMIC of coronavirus disease (COVID- 19) continued to have a significant impact on the health and well-being of the global population. Severe acute respiratory syndrome coronavirus 2 (SARSCoV-2) is known to cause this lethal disease. With COVID-19, efficient screening is crucial to preventing the virus from spreading, as well as providing timely treatment and care for individuals who are affected. To screen for COVID-19, the most common approach is the reverse transcriptase-polymerase chain reaction (RT-PCR) test [1], which is capable of identifying SARSCoV-2 RNA from swabs collected from inside the nose or mouth. When it comes to extremely accurate testing methods, PCR has long been the gold standard, but it is a time-demanding technique that is relatively short in supply.
Biomedical imaging is an effective tool for visualizing interior organs of the body and their diseases. From its early and simple use of chest X-ray (CXR) in the diagnosis of fractures, biomedical imaging has evolved into a plethora of potent techniques used not only in patient care but also in the research of biological structure and function [2]. It could aid in the early detection of infectious diseases such as COVID-19, leading to more effective treatments.
Initial investigations found that patients infected with COVID-19 had abnormalities in CXRs. Thus, radiography assessment is another potential diagnostic approach that has been used for COVID-19 diagnosis. Radiologists conduct and evaluate chest radiographic imagery (e.g., computed tomography (CT) or CXR imaging) to check for visual signs of SARS-CoV-2 virus infection [3]. With the widespread use of CXR imaging in modern healthcare systems, radiography examinations are a great compliment to PCR testing. However, one of the most critical bottlenecks is the requirement for skilled professionals to assess the radiography images due to the fact that the visual markers are often sophisticated. As a result, COVID-19 patients can be detected more quickly and precisely with the use of computer-assisted diagnostic devices, which are widely sought after.
According to preliminary results, patients with COVID-19 infection can be accurately detected by radiographic imaging using deep-learning-based artificial intelligence (AI) systems [4]. However, considering the status of the global health emergency, it is challenging to obtain a significant amount of data that has been carefully selected for use in neural network training. With minimal training data, conventional approaches may possibly suffer from model generalization issues due to the fact that deep learning algorithms often require a large quantity of training data. As such, this article's major goal is to design a deep neural network architecture that can produce clinically explainable findings even with insufficient training data. Moreover, many of the current techniques are "black-box" solutions that do not provide insight into the critical image properties. If resources are limited, such as diagnostic testing or radiologists, AI-aided solutions might greatly aid less experienced primary care doctors triage patients by emphasizing crucial lung areas.
Recently, there has been a lot of attention [5], [6] to the idea of using weighted fusion networks, which integrate multiple deep learning networks to create better prediction results. In this paradigm, we can therefore integrate the strength of several independent networks to attain promising outcomes. When it comes to the effective detection of COVID-19, we have observed that the capability of deeplearning-based fusion models is not adequately exploited in the literature. This study presents a deep learning fusion model with an attention mechanism by combining multiple convolutional neural network (CNN) models to capture key features from CXR images that are fused to acquire strong categorization of these images. Finally, a unified multiheaded architecture that is designed to accurately classify unseen CXR images will be created. Three fine-tuned CNN models, ResNet50V2, VGG16, and InceptionV3, are used to obtain feature encoding from the images, each of which is connected with an attention network. Additionally, we employ a locally connected (LC) layer with weighted contributions from these networks to the extraction of salient features that will be employed in the final classification process. The CXR images from two open-source datasets are used to evaluate our fusion model. It is possible to extract essential features from a limited dataset, and our method provides a realistic strategy for identifying COVID-19 to enhance clinical outcomes. Finally, an explainability technique centered on class activation mapping is presented to emphasize the CXR images most indicative of varied infections.
The following is a summary of our contributions to this article. 1) Automatic COVID-19 screening utilizing CXR images using a multilevel CNN fusion model is presented. 2) Essential features are extracted by multiple CNN models where each model is connected to an attention network, and subsequently, they are merged to improve diagnostic performance. 3) An LC layer is used to obtain a balanced mixture of the base CNN models. 4) The regions of CXR images most indicative of varied infections are depicted using an explainability technique. 5) To our knowledge, an attention-guided deep-learningbased fusion model for early detection of COVID-19 utilizing CXR images is one of the recent initiatives. We structure the remaining sections of this article as follows. Section II includes related work. The approach, dataset, and experiments are presented in Sections III and IV, together with performance results and discussions. Finally, conclusions are made in Section V, which includes ideas for future research.

II. RELATED STUDIES
Recently, applied AI, more specifically, deep learning, has dramatically impacted the field of Biomedical Imaging [7], [8]. There has been a dramatic increase in research and development efforts focused on using AI in biomedical imaging instrumentations for the purposes of early disease detection and prognosis. One such application of using AI techniques is to identify potential COVID-19 infection. In particular, the recent past has witnessed the investigation and analysis of radiography images using various AI approaches to detect the presence of COVID-19. To this end, researchers are focusing on developing statistical learning-based approaches to detect possible Coronavirus infection in CT scans and CXR images. The following are some existing research projects.
Research in [9] presented a deep-learning-based technique for diagnosing COVID-19 infections from CT scans earlier in the outbreak. Zhou et al. [10] leveraged UNet++ deep learning model with a pretrained ResNet-50 backbone network and used about 46 096 anonymous images from 106 patients to train and test the model. According to the model's performance on an internal retrospective dataset, it was 95.24% accurate per patient and 98.85% accurate per image. The system's performance was comparable to that of a skilled radiotherapist for internal prospective patients.
Using lung CT scans, Xu et al. [11] reported a model based on deep learning for discriminating coronavirus infected cases from healthy and viral pneumonia (type Influenza-A) cases. After segmenting images to find potential infected areas, the proposed model was used to categorize the newly discovered regions. To train and test the model, we used 618 CT scans from three different groups: healthy, COVID-19, and Influenza A patients. They were collected in China from three separate COVID-19 facilities. By utilizing this carefully selected dataset, the model demonstrated a medium degree of accuracy (86.7%). In another study [12], Wang et al. suggested a viable deep-learning-based diagnosis model that relies on graphical features in the input CT images. They gathered 1065 CT scans of verified COVID-19 instances, as well as those with normal viral pneumonia. The algorithm was developed by modifying the inception transfer-learning model, which was then tested internally and externally. Experimental results showed a moderate level of performance with an accuracy of 79.3% with the external testing dataset.
In addition to CT scans, CXR images have been employed in various investigations to identify COVID-19. Given the fact that CXR images are more readily available than CT images, particularly in countryside regions, they may be a feasible replacement to CT scans in some situations. Wang et al. [13] proposed COVID-Net for diagnosing COVID-19 using CXR images. They have used residual architectural design in COVID-Net architecture and leveraged a dataset consisting of COVID-19, normal, and other pneumonia-infected cases. A small number of COVID-19 images (less than 100) were used to train and test the model, compared to about 16 000 images of healthy individuals and other pneumonia cases. This imbalance is critical to keep in mind. Using CXR images and a deep learning method, Sethy and Behera [14] were able to classify COVID-19infected people. SVMs are used to classify features extracted from nine pretrained models. In comparison to other models, ResNet50 with SVM showed superior performance in terms of accuracy (95.38%) and F1-score (95.52%).
Delft Imaging [15] researchers have made another noteworthy contribution by establishing an AI model for the diagnosis of COVID-19 using CXR images. A current AI prototype created for tuberculosis diagnosis was used to build the proposed model. It is used to identify patients who may have been exposed to COVID-19. Rahimzadeh and Attar [16] combined Xception and ReNet50V2 into a deep convolutional network to boost accuracy. They suggested a technique for coping with datasets that are not balanced. They evaluated the proposed network using 11 302 images to determine the accuracy (91.4%) that can be achieved in real-world situations. Loey et al. [17] introduced a novel COVID-19 detection model based on generative adversarial network (GAN) and transfer learning. The idea is to collect the images of COVID-19 that are currently accessible and use the GAN network to produce additional images that can be used to identify COVID-19 from these images with high accuracy. They were able to classify healthy, COVID-19, viral, and bacterial pneumonia with 80.6% accuracy using the proposed method. Additionally, some research initiatives [18] have aided in the interpretation of their predictions by extracting crucial COVID-19 biomarkers to acquire a better knowledge of the disease.
Narin et al. [19] reported a method based on transfer learning for classifying COVID-19 encounters. For binary classification, they used ResNet50, which has been pretrained, and got an accuracy of 98%. However, the curated collection contains just 50 COVID-19 images. Oh et al. [20] have developed an alternate method that can be used to train and fine-tune the ResNet18 CNN model. They trained the model using patches taken from CXR images and obtained 88.9% accuracy using a majority voting technique. Ozturk et al. [21] suggested an objection detection-based technique for COVID-19. They have used a DarkNet model to learn how to identify COVID-19 in CXR images. The results of the experiments demonstrated a high degree of binary classification accuracy (98.08%). However, the model only achieved 87.02% accuracy for multiclass categorization.
Two main machine learning standards are evident from the description of previous studies utilizing radiology imaging in COVID-19 detection: 1) classification algorithms that use relevant features from input CXR or CT images and 2) deep learning techniques that use neural networks as input. In general, the second strategy produces superior performance outcomes, with the majority of research employing fine-tuned or pretrained transfer learning models. Nevertheless, because of the current developments of making fusion networks, we came up with a model called "deep fusion." This model takes advantage of blending multiple deep learning models to get better results.
Additionally, the majority of these techniques do not provide enough model explainability in relation to clinical symptoms that are relevant to the disease. Health practitioners are unlikely to use a black-box classification model, even if the results of the experiments are quite reliable. Hence, a gradient-based class activation mapping (Grad-CAM) approach was used in this study to display CXR areas indicative with COVID-19 infections.

III. METHODOLOGY
From the discussion in the literature study, we observe that the potential of fusion technology has not been fully leveraged for the robust diagnosis of COVID-19 encounters. So, we came up with a model that uses deep learning to integrate various deep CNN models to extract important features from CXR images and fuse them to yield robust normal, COVID-19, and pneumonia classification. In summary, our proposed fusion model is composed of three main components.
Feature Extraction: Three pretrained CNNs, ResNet50V2, VGG16, and InceptionV3, are used to extract features from the CXR images.
Attention Mechanism: Each CNN model is connected to an attention network that is used to weigh the features extracted by the CNNs. This allows the model to focus on the most important features for classification.
Classification: An LC layer is used to combine weighted contributions from these networks and extract salient features for final classification.
We start the section with the problem description from the perspective of a classification network-based deep CNN fusion model. Our system's components and technique for COVID-19 screening using CXR data are then described.

A. PROBLEM FORMULATION
When many deep learning models are combined into a weighted combination that can predict better outcomes, there has been an increase in interest in deliberately training these networks. In this way, when we look at deep learning models as a fusion representation, we can combine the strength of different networks to get better results. Finally, we will end up with a single fused multiheaded network that is meant to work reliably for the classification of unseen CXR images. An illustration of our fusion problem is given in Fig. 1.
Given a CXR image dataset, D = {(x n , y n ), n = 1, . . . , N}, where for the nth instance of the input image, y n denotes its class label, and its attribute values are represented by x n . In addition, we consider M total base deep learning models where the ith model is denoted by i, for i = 1, · · · , M. For each image, x n , in D, let z Now, given the combined or fused feature vector output from all the based models, z ∈ R f or z ∈ R fi , final classification results are obtained by another deep learning model, F

B. PROPOSED SYSTEM
Our suggested approach for noninvasive COVID-19 case screening is depicted schematically in Fig. 2. To begin, three commonly utilized deep convolutional networks, InceptionV3 [22], ResNet50V2 [23], and VGG-16 [24], are used to extract features from input CXR images. Then, we combine the features derived from these networks using a weighted combination. Multilabel classification on the fused features is conducted by another convolution and a fully connected layer. Finally, model interpretation is demonstrated through a visualization technique. The next sections describe the system in depth.

1) EXTRACTION OF FEATURES USING MULTIPLE CNN MODELS
Base CNN models were pretrained using the ImageNet dataset and have been employed in our approach to extract features from CXR images. Simonyan and Zisserman [24] created the VGG-16 network architecture, which uses only three convolutional layers stacked one on top of the other. Max pooling layers are also used to lower the volume size. It contains 16 weight layers, two FC layers with 4096 nodes each, and a softmax layer for classification. It has won the ILSVRC-2012 classification challenge on the ImageNet dataset containing over 14 million images across 20 000 categories by outperforming its pioneer AlexNet [25] with the help of smaller-sized filters. As the number of layers increases in typical deep learning architecture, an issue known as vanishing or exploding gradient occurs. As a result, the gradient either becomes zero or too large. As the number of layers increases, the error rate during training and testing increases. To address this issue, the ResNet architecture established the notion of Residual Network, which used a mechanism known as skip connections. The skip connection bypasses a few stages of training and links directly to the output. Adding a skip connection like this one is advantageous since it allows regularization to bypass any layer that negatively impacts the design. As a result, an extremely deep neural network may be trained without having to deal with vanishing or exploding gradients. Because this version of ResNet (ResNet50V2) includes 50 weight layers, the overall model size has been significantly reduced. Finally, we used the inception model [22] in our fusion network, which consists of a repeating set of components called Inception modules. Convolution layers with different sizes of filters and a 3×3 max pooling layer are arranged in a block. The results of these are then put together to make a new layer. It was originally known as GoogleNet, but the succeeding versions of this network are simply referred to as Inception. A more recent version of the Inception module, Inception V3 [26], was employed in this study to boost classification performance further. It is known for having lower weights than the other two networks.

2) ATTENTION NETWORK
We have used the base CNN models in our fusion network, where each of them is coupled with a visual mechanism to aid the CNN models to focus on parts of the input image instead of paying attention to the complete image.
Inspired by the work presented by Jetley et al. [27], in which they have used a soft attention module in CNN architecture for improving multilevel classification accuracy, we have added an attention layer after the final FC layer of each base CNN model, as illustrated in Fig. 2. Batch normalization (BN) regulates the input to ensure a stable learning process and significantly lower training duration using the feature input from the base model. One-by-one convolutions are used to reduce the amount of feature maps. The global average pooling (GAP) layer, which looks to be fairly simple, is modified to emphasize the areas of relevance since certain areas are much more relevant than others in the attention section. For this purpose, we create an attention component, which modifies the pixel values in the GAP layer prior to pooling. More specifically, the application of attention mechanism produces a map of weights, where each weight represents the importance of the corresponding pixel. The GAP is then applied to the image, but only the pixels with the highest weights are considered. The results of the GAP are then rescaled by dividing them (using a Lambda layer) by the number of pixels that were considered. This ensures that the attention mechanism's influence is appropriately scaled, considering the varying number of activated pixels. It further helps maintain consistency in the model's response across different input sizes and allows for better interpretability and comparison of the attention weights. Thus, we can consider the model as a weighted variant of GAP. It can improve the performance of GAP by focusing on the most important pixels in the image and can help prevent the GAP from overfitting to the training data. Fig. 3 depicts the attention architecture in greater detail.

3) WEIGHTED FEATURE FUSION
As shown in Fig. 2, we combine feature vectors collected from the base CNN models to achieve the desired result. This process of combining features can be thought of as a joint representation of the feature vectors commonly recognized as "early fusion." The joint feature representation is then passed through a sequence of convolution and dense layers before the final classification. Therefore, our fusion model may be considered to be trainable from beginning to end, and it possesses the capacity to both learn feature representation and carry out multilabel classification via prediction. Images can be represented in a more accurate and detailed manner when CNN features are fused together. Our method uses weighted feature fusion, inspired by the technique described in [28] for diabetic retinopathy detection.
For the fusion of the different CNN models, our method uses an LC layer. Convolutional layers are comparable in many respects to LC layers, but without the use of shared weights. Convolutional layers use the same filter weights across all pixel positions, whereas LC layers learn different weights for each local field by utilizing nonsharing filters. In an LC layer, nonsharing filters are used to learn unique weights for each local field [29]. An improved fusion encoding of features can be achieved by using LC layers to learn adjustable weights for diverse base CNN models. Fig. 4 compares it with two additional often used fusion techniques, namely, fully connected without learning any weights and, as previously noted, convolutional fusion with weight sharing.
There are three base CNN models each with an attention layer in the fusion process. In the first step, we combine all of the GAP layer features from these base models into a single GAP layer called Gfusion having a dimension of 1× G × M. As an example, it is possible to write z(i) to represent Gfusion's ith feature map, where 1 ≤ i ≤ M. In the next step, an LC layer containing G filters with nonsharing weights is convolved across Gfusion. Each of these filters has a dimension of 1 × 1 × M. Consequently, the LC layer can produce an improved feature encoding by learning adjustable weights for distinct feature maps of Gfusion that indicate the significance of each underlying CNN model. Finally, we get a 1-D feature vector of size G, which set termed as z(f ). The nth component of z(f ) that may be represented in this way where φ denotes the activation function. Furthermore, the w n,i (f ) and b n (f ) parameters in G fusion reflect the weights and bias used to combine the nth element of each feature from the various underlying CNN models used. Therefore, we achieve weighted feature fusion without having to manually change anything at the cost of more G*(M+1) LC layer parameters.
In model training, we have used the Adam optimizer with mini-batch, as is standard practice. To summarize, our fusion network functions as a multiheaded model that receives similar image data as input. Intermediary feature vectors created by the base models coupled with attention networks are fused and fed through an LC layer for the weighted contribution of features from distinct base CNN models. The whole fusion process is explained in Algorithm 1.

IV. EXPERIMENTS AND RESULTS ANALYSIS
In this section, we compare our fusion model's effectiveness for COVID-19 detection using a curated CXR dataset.

Algorithm 1 Fusion of CNN Models in Classification of CXR Images
Input: Data for training T = {x i , y i } 1≤i≤N , Data for testing T test , base CNN models Output: Prediction outcomes from a fusion model T train , T validation = split dataset, T for k = 1 to k-fold do Generate feature vectors from base CNN models (M) coupled with attention network: for m = 1 to M do Extract feature vectors z (m) based on T train end for z (f ) = weighted concatenation([z (1) , z (2) , . . . , z (M) ] with fusion weights w 1, w 2 , . . . , w M Construct a new dataset, G fusion containing the weighted features and target label: is the final classifier to be learned from the newly created dataset, G fusion Validate F (k) with T validation end for Classifying test data: outcomes = classify (F, T test ) return outcomes

A. EXPERIMENTAL DATA
We use COVID-19 CXR images from a publicly available dataset [30] containing 616 instances. In addition, we gather 1232 CXR images of healthy individuals and pneumonia patients from an open access Kaggle repository [31]. Our final curated dataset consists of 1848 CXR images from COVID-19, healthy, and pneumonia categories each with an equal count. The whole dataset is divided into three sets: 1) training; 2) validation; and 3) test, with a 60:20:20 ratio for each group. The distribution of images in each class for all three sets is shown in Table 1. We have split the dataset into fivefold. Model training takes place on training and validation sets (fourfold) while the remaining hold-out test set is used to assess the model performance. The images from the patients used for training and validation were not included in the test sets. Thus, it helps ensure that the evaluation is performed on unseen data. This practice helps prevent data leakage and provides a more reliable assessment of the model's performance. This allows for a more realistic estimation of the model's ability to generalize and make accurate predictions on unseen images from different patients.

1) PREPROCESSING
The CXR images included in the collection were obtained by a diverse set of imaging procedures. Despite this, we refrain from performing heavy preprocessing on the CXR images  Finally, we apply image augmentation to address the issue of a small dataset and enhance training efficiency without overfitting the models. Table 2 summarizes the augmentation features utilized to prepare the training dataset. We train and test models in the Google Colab notebook environment, which gives us free access to GPUs and is equipped with TensorFlow backend and Keras API. A dense layer consisting of 256 neurons and ReLU activation function is added to the fusion model, after the locally linked layer has already been established. In the final step, a dense layer that contains a softmax unit is added in order to provide classification scores. Adam optimizer is also used to train and optimize models, starting with a learning rate of 0.001. In addition, we make use of a Keras callback that is known as ModelCheckpoint to monitor performance metrics and to save the model at regular intervals according to some monitoring conditions. In this study, we use standard performance metrics, including sensitivity (recall), accuracy, precision, specificity, and area under curve (AUC).

B. PERFORMANCE RESULTS AND DISCUSSION
On Google Colab, we train the model for about an hour and fifteen minutes across 50 epochs using fourfold crossvalidation technique. Thus, each fold takes around 19 min to complete the training and validation process. Using a holdout test dataset, we evaluate the performance of base CNN models and our fusion model. On average, our fusion model exhibits an inference time of around 650 ms, which refers to the duration it takes for our model to process an input image and produce the classification result. In medical situations, including the detection of COVID-19, prompt and accurate results are essential for effective decision making and patient care. This inference time is expected to enable faster processing of individual images, thereby accelerating diagnosis and potential treatment plans. Table 3 shows overall performance results for all studied models. Additionally, the performance of each individual class is also assessed based on the same parameters (shown in Table 4) as used in the overall performance evaluation. The fusion model regularly outperforms the base models in terms of performance criteria, such as AUC, accuracy, specificity, and precision. Holdout test datasets, which are crucial performance estimates for medical applications, yield remarkable accuracy (96.75%) and specificity (99.15%) for the fusion model. It clearly demonstrates the superiority of utilizing a weighted mixture of characteristics collected from several base models in a fusion network for categorizing COVID-19 patients over other current techniques. Our model's specificity results show that it can distinguish 99.15% of all negative COVID-19 instances from those that are positive. ResNet50V2, on the other hand, surpasses other CNN models with respect to sensitivity and accuracy among the base models. It is important to point out that InceptionV3 demonstrates a slightly higher sensitivity (92.6%) than other base models.
The effect of model fusion in comparison to each CNN model is also clarified (in Table 4) by the results that are particular to each class. Individual CNN networks have a tendency to perform relatively poorly when categorizing images of pneumonia, although they do moderately when detecting healthy patients. ResNet50V2 had improved accuracy and sensitivity when classifying COVID-19-positive individuals. Interestingly, the fusion model harnesses the benefit of a weighted mixture of various basic CNN models by keeping ResNet50V2's strength to compensate for VGG-16 and InceptionV3's deficiencies in improving all metrics for all three classes.
Hypothetically, this mentioned result is noteworthy since reliably identifying CXR images for all three categories is crucial for an effective diagnostic tool. During training, all models show a moderate learning tendency, with training and validation losses decreasing consistently. Moreover, the fusion model training and validation (in Fig. 5) seem to better converge using the same number of training epochs like other models. Even though the curated dataset contains limited samples, the learning curves show that the models are not prone to overfitting. This is essentially attributed to the fusion model's generalizability, data augmentation applied to the training set, and the use of regularization techniques such as dropout applied to the fusion model. Fig. 6 presents each model's receiver operating characteristic (ROC) curve, which can help provide a more in-depth knowledge of the performances of the models that were tested. An ROC curve is constructed by plotting the true positive rate (TPR) versus the false positive rate (FPR). The ROC curve (shown in Fig. 6) demonstrates the fusion model's stability, as the area under the curve for all classes is quite comparable, and the model achieves an average AUC score of 0.9723 for all classes (as given in Table 3). Non-COVID classes are poorly classified by other basic CNN models, whereas COVID-19 classes are well discriminated. This is shown by COVID-19's considerably higher curve. Based on our analysis of the confusion matrix as shown in Table 5, the fusion model generates very less false positive (FP) and false negative (FN) cases (2 and 10, respectively) for COVID-19 classes compared to other classes (e.g., 10 cases of FP and FN for both normal and pneumonia cases) and other base models as well. This is vital for our model to minimize wrong diagnosis. A low FP value suggests that fewer cases are misdiagnosed as COVID-19 positive, and thus positively leads to higher accuracy and specificity scores.
To avoid unnecessary financial burdens on health care providers, it is critical to decrease the number of FP cases. Similarly, the decreased number of FN cases suggests that fewer COVID-19 cases are missed, which leads to an increase in sensitivity. Keeping the number of FN cases low is essential to avoid the model misclassifying an infected individual as healthy, which would then impede the patient's ability to receive effective treatment. On the other hand,   fusion model presents a balanced classification performance by reducing FP and FN counts for all classes of images in the dataset. Based on these obtained results, the proposed fusion model appears to be the best performing among all evaluated models.
Although the dataset utilized in this work to evaluate the proposed approach is balanced, it is critical to consider how well the technique performs on imbalanced datasets, which are more representative of real-world settings. The effectiveness of the proposed model can be impacted by an imbalanced dataset, where one class is significantly more prevalent than the others. This is due to the model's potential bias toward the majority class and its inability to accurately describe and classify minority classes. However, the challenges caused by imbalanced datasets may be mitigated due to the addition of an attention mechanism to our fusion model. The attention mechanism may enhance the identification and representation of the minority class by enabling the model to concentrate on the most relevant regions, resulting in improved overall performance on imbalanced datasets.
In addition, there are a number of techniques that can be used to address class imbalance issues, such as oversampling or undersampling in conjunction with the proposed fusion model.

C. MODEL INTERPRETATION
As part of the qualitative study, we examine models' decisions regarding COVID-19. To be more specific, it is very important to have an understanding of how the models are being trained and validated. class activation maps (CAMs) [32] were developed as a way to show which parts of an image are used by the CNN to identify the output class when making predictions. This is accomplished by projecting the weights of the output layer back to the final CNN component. With Grad-CAM [33], which is a generalized version of CAM, we may create a localization map that highlights the critical areas of an image for accurate prediction. Fig. 7 displays a GradCAM heatmap of a COVID-19 patient's CXR picture. The original image (with emphasis areas denoted by red arrows) is shown alongside a heatmap emphasizing the lung's most critical locations, as well as a fusion of the heatmap with the original image. The analysis of areas of concentration can provide valuable insights into how the studied deep learning models made decisions and predictions based on visual information. The different regions of concentration that the different models focused on signify different features that the models were using to  make their predictions. More specifically, Grad-CAM highlights localized regions of the image where the model finds distinctive features that are relevant to the target class. In this particular example of Fig. 7, the fusion model and VGG-16 concentrate on both the left and right sides of the respiratory track. Despite the fact that both models have given accurate predictions for the image, ResNet50V2 and InceptionV3 appear to concentrate on comparatively wider areas, including the lower respiratory track. A skilled radiologist must do a thorough clinical testing to confirm this. The ultimate goal is to ensure that our model will make correct predictions based on the data that it receives from the images.

D. DISCUSSION
In summary, we have shown (in Table 6) how our suggested fusion model stacks up against some of the most current and widely used methodologies in the field. It is important to point out that the dataset used in this investigation, which is composed of images from COVID-19, is relatively restricted. Some earlier work (e.g., [34] and [35]) trained models even with fewer than 100 images from COVID-19 category. COVID-Net proposed by Wang et al. [13] is one of the first attempts to find coronavirus infection using CXR images. They have suggested a personalized deep learning algorithm for predicting COVID-19 cases. However, the number of COVID-19 positive images in the dataset that is used for model training is very limited (less than 100) as compared to the normal and non-COVID-19 image. This makes their dataset rather imbalance which may adversely affect the performance of the model. Furthermore, the size of parameters (111.6 million) used in COVID-Net is almost double than our proposed fusion model which contains about 62 million parameters.
Thus, our fusion model offers a substantial computational savings in addition to performance gain as reported in Table 6. Another subsequent effort was done by Ucar and Korkmaz [36] who have used a Squeezenet CNN model for multilabel classification of CXR images. They have reported a state-of-the-art comparable accuracy of 95.7% on their dataset but the model seems to suffer from producing higher FN cases which lowers down their sensitivity score (90%). Overall, the proposed fusion model shows superior performance results as compared to state-of-the-art methods while using a relatively higher number of COVID-19 samples (616) for model training. On the basis of the work done to date for utilizing deep learning models to automatically detect COVID-19 infections we can recognize the role of AI in supporting radiologists with the accurate and rapid diagnosis of possible coronavirus infections This model is not a replacement for human radiologists, but rather it will help bring more attention to and use of AIpowered solutions in clinical settings. Despite the fact that the results of CXR images alone are not enough to decide how to treat a patient, an early diagnosis can help doctors keep positive results in a safe place until a more thorough test is ordered. Moreover, while alternative diagnostic methods, including antigenic swab tests, are more widely used and easily adaptable for mass screening, there are several advantages to employing X-ray imaging that should be highlighted. These include widespread availability, rapid processing time, low cost, and noninvasive nature.

V. CONCLUSION
In this study, we provide a novel CNN fusion model that may be used for the noninvasive diagnosis of COVID-19 instances based on biomedical imaging instrumentation. To better understand the dynamics of the COVID-19 pandemic, this research presents a cutting-edge AI-based solution for an efficient and quick diagnosis method of COVID-19 infections. The suggested method intends to do this by creating a CNN fusion model with an attention mechanism that is composed of a weighted combination of multiple CNN base models. With respect to the COVID-19 classification, we use pretrained ResNet50V3, VGG-16, and InceptionV3 models that have been fine-tuned with an attention network to extract important features that are resistant to overfitting and exploit an LC layer. The findings of the experiments show that the fusion model we proposed with the attention mechanism provides a solution that is both highly accurate (96.75%) and practically applicable with interpretation results for use in clinical practice to expedite the line of patients' therapy. Our model's high AUC (0.960) value also indicates its ability to distinguish between COVID-19, healthy, and non-COVID pneumonia cases. In the future, we intend to improve prediction performance by employing the semantic segmentation technique. Finally, more attention should be given to the applicability of the proposed fusion model in the real world. To achieve the goal, we need to concentrate on gathering a more representative dataset, testing the model on a different test dataset, and improving the model to better handle noise. By addressing these concerns, the proposed model could be a useful diagnostic tool for COVID-19 in the real world.