Volumetric Model Genesis in Medical Domain for the Analysis of Multimodality 2-D/3-D Data Based on the Aggregation of Multilevel Features

The automatic and accurate classification of medical imaging data has potential applications in computer-aided disease diagnosis, prognosis, and treatment. However, it remains a challenge to optimize recent deep learning algorithms in the medical domain for the accurate classification of large-scale three-dimensional (3-D) volumetric data. To address these challenges, we propose an efficient deep volumetric classification network based on the aggregation of multilevel deep features for the accurate classification of large-scale medical 2-D/3-D imaging data. To perform a detailed quantitative analysis of our method, 26 different datasets were fused to construct a single large-scale multimodal database that comprises a total of seventy different classes, including 151,095 data samples. Additionally, 15 different baseline methods were configured under the same experimental protocol for volumetric model genesis and extensive performance comparison with our method. The experimental results of our method exhibited promising performance as an area under the curve of 93.66% and outperformed various state-of-the-art methods.


I. INTRODUCTION
W ITH the development of digital devices, the use of differ- ent types of imaging modalities [such as magnetic resonance imaging (MRI), X-rays, optical projection tomography (OPT), ultrasonography, computed tomography (CT), angiography, positron emission tomography (PET), and visible light cameras] has become commonplace in the medical diagnostic domain.These imaging modalities provide diagnostic assistance to medical experts by capturing the visual representation of different body organs as 2-D/3-D imaging data [1], [2].Consequently, the production of multimodal 2-D/3-D imaging data has grown exponentially in recent years.In addition, the application of multimodal data in various medical diagnosis areas is also increasing rapidly.For example, multimodal images such as CT and MRI images are being fused to create a single mark image that can be more suitable for diagnostic evaluation than individual images [3], [4].Recently, a variety of multimodal fusion-based algorithms are evolving for safe and secure telehealth applications [5].Therefore, effective organization and analysis of existing multimodal data can offer various potential applications in the medical domain.For example, medical professionals can obtain a diagnostic clue for a complex medical condition by retrieving relevant cases from the existing database using efficient classification algorithms.Consequently, an accurate and timely diagnosis of acute medical conditions results in better treatment [1], [2].
However, subjective exploration, classification, and retrieval of intended content from a huge collection of visual data are challenging and time-consuming tasks.Recently, advancements in the artificial intelligence (AI) domain have provided various potential applications in general, as well as in the medical field [6], [7].Efficient analysis of medical imaging data is also one of the key applications of AI algorithms.Consequently, various state-of-the-art computer-aided diagnosis (CAD) methods have been proposed in the literature that utilizes the power of AI in medical data analysis and enable effective diagnostic decisions [8], [9], [10], [11].Among the different AI methods, a subset of deep learning (DL) algorithms has received special attention owing to its remarkable performance, particularly in the case of visual data analysis [12], [13], [14].Such DL-driven CAD methods mimic the processing of the human brain and deliver accurate diagnostic results, similar to those of medical experts.With respect to image-and sequence-based CAD methods, convolutional neural networks (CNNs), a well-known variant of DL algorithms, have received special consideration.Various types of CNNs [15], [16], [17], [18] have been proposed in the literature for general and medical applications.The structure of a CNN model mainly comprises multiple convolutional layers and fully connected (FC) layers, including trainable parameters [9].Initially, these parameters are trained using an independent training dataset.Consequently, a trained model classifies the testing data sample into its target class after analyzing it through multiple convolutional and FC layers.

A. Potential Research Gaps and Motivation
Most of the existing methods [18], [19], [20], [21] are disease and modality-specific, and optimized for a limited number of data samples.Moreover, these methods are designed to make diagnostic decisions based on 2-D imaging data employing image-based classification models [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], even in case of 3-D imaging data.There is very limited research related to the joint classification of multimodal 2-D/3-D imaging data considering a large number of classes.The main objective of this study is to encapsulate the computer-aided diagnostic capability of various kinds of diseases in a single DL model that can be scaled up in future work by including more data and classes.In addition, we aim to provide new grounds for developing an efficient jointly connected content-based medical image and sequence retrieval (CBMISR) framework by applying our proposed classification model.Various existing medical retrieval methods in the literature are image-based and consider limited classes and data samples to validate their proposed models.Therefore, we further highlight the application of our proposed model in developing a jointly connected 2-D/3-D imaging retrieval framework.An efficient CBMISR framework can assist medical professionals in validating their diagnostic decision for a complex medical condition by retrieving relevant cases from the existing database.Finally, we aim to encapsulate the diverse features of large-scale medical 2-D/3-D imaging data in a single model that can provide new grounds for future research related to medical domainspecific transfer learning (MDS-TL).Based on the proposed framework, the strengths of MDS-TL can be further explored and additional performance improvements can be achieved in various medical diagnostic applications.

B. Main Contributions
We mainly propose a novel jointly connected classification framework based on a multiscale dilated fused (DF) residual network (MDF-RN) and a spatiotemporal block classification network (STB-CN) for the classification of both medical 2-D/3-D imaging data.This is the first study to present a pretrained classification model in the medical domain, which is trained with a large-scale multimodal database that includes both 2-D/3-D imaging data.The main contributions of this study are as follows.
1) The main contribution is the development of a novel 2-D-CNN architecture (named MDF-RN) that leverages multiscale dilated convolution and a concept of multilevel feature fusion in a mutually beneficial manner to achieve state-of-the-art performance.2) Three additional branches are created in the proposed MDF-RN model by including three DF-blocks that primarily exploit multiscale/multilevel features and enhance the overall performance.3) Subsequently, the second-stage STB-CN model further utilizes the strength of recurrent neural networks (RNNs) and transfer learning in classifying 3-D imaging data without influencing the overall training parameters of the whole pipeline (MDF-RN+STB-CN).4) The proposed STB-CN model works for both 2-D and 3-D imaging data and does not limit the processing of fixed-length sequences as restricted by 3-D-CNNs, but can classify variable-length sequences of successive slices/frames.5) In addition, we evaluated the performance of fifteen stateof-the-art image-based and sequence-based classification models to provide standard benchmarks for this study.Finally, our proposed classification framework (including implementations of both MDF-RN and STB-CN models) is freely accessible to enable fair comparisons and future research.The rest of this article is organized as follows: Section II presents a brief literature review related to DL-driven CAD methods.Section III provides a detailed explanation of the proposed methodology.In Section IV, we briefly describe the datasets, experimental protocols, and results.Finally, the results are discussed and conclusions are drawn in Sections V and VI, respectively.

A. Image-Based Methods (2-D Models)
In the context of 2-D imaging data, Adnan et al. [7] proposed a classification-based medical image retrieval framework using a revised version of AlexNet [21] that can classify multimodal (CT, MRI, PET, OPT, and fundus camera) 2-D imaging data into one of 24 different classes.In another study, Falconi et al. [8] utilized the strength of transfer learning in breast lesion classification tasks using different pretrained CNN models.Three CNN architectures were modified and trained to classify Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
mammogram images into one of six classes.Among the existing models, VGG19 [23] achieved superior results.Later, Owais et al. [9] addressed the limitations of [7] and proposed a new content-based medical image retrieval framework based on a modified version of ResNet50 [24], which was trained to classify multimodal 2-D imaging data into one of 50 different categories, including both disease and normal cases.
Apart from [7] and [9], most of the existing CNN-based CAD methods are domain-specific and perform binary classification (either diseased or normal).For example, Kaur et al. [10] proposed a CAD method using a pretrained VGG16 model [23] with the capability to categorize pathological brain images as normal or abnormal.However, a limited number of data samples (20 normal and 140 abnormal MRI images) were used to validate the proposed method.Subsequently, Ashraf et al. [11] used another pretrained CNN named GoogleNet [25] for medical image classification.A multimodal dataset (including 3600 images related to 12 different categories) was used to train and validate the method.Their method also includes a limited number of data samples (300 images per class).Similar to [10], Akpinar et al. [12] proposed a binary-classification CAD method for detecting chest abnormalities.An existing pretrained SqueezeNet [26] model was employed to categorize X-ray images into normal or abnormal groups.In total, 660 X-ray images were used to validate the method.Subsequently, Aloyayri et al. [14] utilized the strength of transfer learning in breast cancer classification using histopathological images.Three different CNN architectures were trained to classify data samples as either benign or malignant.Among the different baseline models, ResNet18 [24] achieved superior results.
Souid et al. [15] proposed a multiclass diagnostic framework for chest lesions.A lightweight deep CNN model, named MobileNetV2 [27], was trained to predict multiclass lung pathologies (considering 14 different classes) from chest X-ray images.A single-modality large-scale dataset, including a total of 64699 images, was used to calculate the performance of the CAD method.Similarly, Jasil et al. [16] and Çakmak et al. [17] utilized different CNNs for skin lesion classification tasks.In [16], a pretrained CNN, named DenseNet201 [28], was employed to classify dermoscopy images into one of seven different classes of skin lesions.A single-modality limited dataset, including a total of 3091 images, was used to validate the method.Later, Çakmak et al. [17] used a lightweight CNN, named NASNet-Mobile [29], for melanoma detection from dermoscopy images.They considered a larger dataset (including a total of 10015 images) than that of [16].In the context of diabetic retinopathy (DR), Gambhir et al. [18] proposed a severity classification CAD method that was able to detect and distinguish DR into different severity levels.An existing ShuffleNet [30] model was trained to categorize the input DR image into one of five different classes (including one normal and four diseased cases).

B. Sequence-Based Methods (3-D Models)
There is very limited research related to the classification of large-scale multimodal 3-D imaging data for clinical decision support systems.For example, Shahzadi et al. [19], Srinivasu et al. [20], and Ebrahimi et al. [21] proposed sequence-based classification methods using existing CNNs and long short-term memory (LSTM) models [31].Shahzadi et al. [19] proposed a binary-classification framework (comprising VGG16 [23] and LSTM [31] models) for the recognition of brain tumors from 3-D brain MRI scans.Subsequently, a skin lesion classification framework based on the MobileNetV2 [27] and LSTM [31] models was proposed by Srinivasu et al. [20].Rather than using a single image, a sequence of images was used for disease classification.Similar to [19], another binary-classification framework for Alzheimer's disease detection was proposed in [21].A cascade of ResNet18 [24] and LSTM models [31] was configured using 3-D brain MRI scans.Ebrahimi et al. [21] used a larger dataset (compared to [19]), including a total of 35550 MRI samples.

C. Limitations of the Existing Methods
The concept of joint multiscale and multilevel feature fusion has gained less attention in medical 2-D/3-D imaging data classification.Different fusion techniques, such as early fusion, late fusion, and ensemble learning [6], exist and have improved DL performance.However, they require additional pre-and postprocessing overhead and influence the overall computational cost of a DL model.In a recent study, Abdar et al. [6] explored the strength of multilevel feature fusion by employing the concept of conventional ensemble modeling and proposed an image-based classification model.However, their proposed feature extractor scheme consists of a total of four different pretrained models, containing a total of 162 million trainable parameters and requiring extensive computational overhead.In addition, most existing studies related to medical data analysis are disease-specific and consider a limited number of classes, as well as data samples, to develop and validate their proposed classification methods.Moreover, various methods employed image-based models that consider only spatial information in making diagnostic decisions in the case of 3-D imaging data such as CT or MRI scans.Consequently, the loss of 3-D anatomical information may result in false predictions and finally a decrease in the overall prediction probability of the testing data.

D. Singularity of Our Method
To address the limitations of existing studies, we propose a domain-specific pretrained model related to medical diagnostic applications using large-scale, multiclass, and multimodal 2-D/3-D imaging data.This is the first study to present a pretrained classification model in the medical domain including both 2-D/3-D imaging data.In total, 26 publicly available datasets (based on 11 different modalities) [9], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41] were fused to construct a single large-scale database comprising 70 different classes, including 151,095 data samples.The proposed CAD solution utilized the strength of multiscale/multilevel feature fusion and encapsulates the computer-aided diagnostic capability of various kinds of diseases in a single DL model.Our proposed model leverages transfer learning in classifying 3-D imaging data without influencing the overall training parameters and works for both 2-D and 3-D

A. Workflow Overview
This study aims to develop a deep classification model with the capability to classify multiclass medical data, including both 2-D/3-D imaging data.In particular, the proposed method can classify a variable-length sequence of n successive slices/frames (i.e., I 1 , I 2 , I 3 , . . ., I n , F 1 , F 2 , F 3 , . . ., F n ) with significant performance gain compared to image-based models.After selecting appropriate datasets, we developed a cascade of two classification networks, MDF-RN and STB-CN, for the accurate classification of multimodal 2-D/3-D imaging data.The overall procedure of the proposed model development mainly includes a training phase followed by a testing phase as shown in Fig. 1.Both networks were trained, validated, and tested using independent training, validation, and testing datasets.In the first step, an untrained MDF-RN model was trained to exploit and learn the spatial features from the training dataset that included a total of p data samples and corresponding class labels notated as In the next step, all training data samples [F T ] p i=1 were converted into feature vectors [f T ] p i=1 after processing each data sample through our trained MDF-RN model.Consequently, we obtained a new training dataset (denoted as ) in the feature domain.In the next step, the second untrained STB-CN model was trained to learn the 3-D anatomical dependencies (in the case of 3-D imaging data) from We divided the training data into 2-D and 3-D imaging data according to the given information in each class label.In detail, all the classes with 2-D imaging data are notated with "I" along with the name of their actual class labels as shown in Fig. 1.Similarly, 3-D imaging classes are differentiated with "V" along with the name of their actual class labels as mentioned in Fig. 1.
After training, the performance of the proposed classification framework (MDF-RN+STB-CN) was evaluated for an independent testing dataset, denoted as In the case of 2-D images, a trained MDF-RN model exploits the spatial features and performs class prediction.In the case of 3-D imaging data such as endoscopy videos, CT, and MRI scans, the second trained STB-CN further improves the overall performance by exploiting 3-D anatomical dependencies and results in an additional performance gain.Initially, MDF-RN sequentially transforms the sequence of n successive slices/frames (i.e., F 1 , F 2 , F 3 , . . ., F n ) into n feature vectors (i.e., f 1 , f 2 , f 3 , . . ., f n ).Then, the second stage STB-CN model parallelly processes these feature vectors to exploit additional 3-D anatomical features and perform class prediction.To provide visual insight into the network decision, we also visualized the class activation map for each input 2-D or 3-D imaging data sample as an additional output (see Fig. 1).

B. MDF-RN Model Structure and Workflow
To achieve superior classification performance and fast execution speed, the proposed MDF-RN design utilizes the following strengths.
1) Residual blocks [labeled skip residual (SR)-block and projected residual (PR)-block in Fig. 2] of ResNet (RN) models [24].2) Our newly included DF-block (as shown in Fig. 2) based on multiscale dilated convolution layers.3) A concept of multilevel feature fusion in a mutually beneficial manner.The complete structure of our MDF-RN model includes five SR-blocks, three PR-blocks, three DF-blocks, and some other layers, as shown in Fig. 2.
1) Residual Blocks: In general, residual blocks (SR-and PR-blocks) avoid the vanishing gradient problem in a training process and achieve the optimal convergence of the entire network.Therefore, we selected the residual blocks in our network design to exploit high-level semantic features.Both residual blocks consist of two 3×3 convolutional layers and a residual connection, as shown in the bottom-left corner of Fig. 2. The SRblock includes a SR connection and transforms the input tensor F k ∈ ℛ w k ×h k ×d k into the final output tensor F l ∈ ℛ w k ×h k ×d k without influencing the dimension.By contrast, the PR-block consists of a PR connection based on a 1×1 convolutional layer and maps the input tensor undergoes the following transformations after passing through these residual blocks: where Ψ SR (•) and Ψ P R (•) denote the SR-and PR-blocks as transfer functions, respectively.h ϕ k (•) h ϕ l (•), and h ϕ m (•) represent the convolutional layers with training parameters ϕ k , ϕ l , and ϕ m , respectively.
2) DF Block: Additionally, we proposed a DF-block (as shown in the bottom-left corner of Fig. 2) followed by an average pooling layer to capture a multiscale representation of multilevel (i.e., low-, intermediate-, and high-level) features acquired from Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.different residual blocks (see Fig. 2).The key intuition behind the development of DF-block is to aggregate the multiscale representation of deep features at different resolutions.Quantitative results (in the result section) have shown the significant strength of our designed DF-block.The structure of our proposed DF-block includes a total of three parallelly connected dilated convolutional layers (with a filter size of 3×3 and dilation rates of 6, 12, and 18) and a PR-connection based on a 1×1 convolutional layer.The DF-block transforms the input tensor F k ∈ ℛ w k ×h k ×d k into the output tensor F l ∈ ℛ w k ×h k × 2d k by exploiting additional multiscale features from the output of different residual blocks.Mathematically, the input tensor F k ∈ ℛ w k ×h k ×d k undergoes the following transformations after passing through a DF-block: where Ψ DF (•) denotes the DF-block as a transfer function.
represent simple and dilated convolutional layers with training parameters ϕ k and ϕ x k , respectively.The symbol presents the depth-wise feature concatenation.
3) Multilevel Feature Fusion: A concept of multilevel feature fusion is introduced in our MDF-RN model by aggregating the joint contribution of the multiscale low-, intermediate-, and high-level semantic features (i.e., f 1 -f 4 ) in the final classification decision.These multilevel features are obtained from different residual blocks using multiple DF-blocks (see Fig. 2) and provide a diverse representation of a particular class.A detailed ablation study (in a later section) shows the substantial contribution of multilevel feature fusion in achieving state-of-the-art performance.
4) Model Workflow: Initially, a 7×7 convolutional layer explores the input image F and generates an output tensor of size 112×112×64, which is further processed by a 3×3 maxpooling layer and downsampled into a new output tensor of size 56×56×64.Consequently, a stack of nine building blocks (including five SR-blocks, three PR-blocks, and one DF-block, as shown in Fig. 2) sequentially processes the output of the preceding layer/block and finally generates a multiscale high-level feature vector of size 1×1×1024 (labeled as f 4 in Fig. 2).
Additionally, three DF-blocks were included to exploit multiscale low-and intermediate-level semantic features (i.e., f 1f 3 ) from three different residual blocks.These residual blocks are selected based on the different spatial sizes of their output tensors (i.e., 56×56, 28×28, and 14×14) to obtain low-and intermediate-level semantic features.Moreover, each DF-block is followed by an average pooling layer that further transforms the 2-D output tensor of the DF-block into a 1-D vector space.A depth concatenation layer followed by the first FC layer (FC1; Fig. 2) fused all multilevel semantic features (i.e., f 1 -f 4 ) and further exploited more discriminative patterns.Consequently, we obtained a multilevel semantic representation of input image F as an output feature vector f of size 1×1×256.In the case of a 2-D image, the MDF-RN model further performs the class prediction by processing the output feature vector f with a stack of three additional layers (FC2, SoftMax, and classification layers; Fig. 2).

C. STB-CN Model Structure and Workflow
In the case of 3-D imaging data consisting of n successive slices/frames (i.e., F 1 , F 2 , F 3 , . . ., F n ), the proposed MDF-RN model sequentially processes each input slice/frame and generates a set of n feature vectors (i.e., f 1 , f 2 , f 3 , . . ., f n ) of size 1×1×256×n.All these feature vectors are extracted from the FC1 layer of our MDF-RN model.These feature vectors are further processed by the second-stage STB-CN model to exploit additional 3-D anatomical features and perform class prediction.The STB-CN includes a revised variant of RNNs called the LSTM model [19], [20], [21], which resolves the vanishing gradient problem in the training process and can leverage transfer learning in the case of volumetric data analysis without influencing the overall training parameters.Therefore, we utilized the strength of LSTM in designing our second-stage STB-CN model for the effective classification of volumetric data in the medical domain.
The overall structure and workflow of the proposed STB-CN are shown in Fig. 2. First, a sequence input layer passes a set of n feature vectors (i.e., f 1 , f 2 , f 3 , . . ., f n ) to the LSTM layer, which exploits additional 3-D anatomical dependencies among these feature vectors after processing through a sequence of n LSTM cells (see Fig. 2) and finally generates a single feature vector h n of size 1×1×1200 (obtained from the last LSTM cell).The output feature vector h n incorporates both 2-D spatial and 3-D anatomical information of the 3-D imaging data (i.e., F 1 , F 2 , F 3 , . . ., F n ) and further refined by a third FC layer (FC3; Fig. 2) to exploit more discriminative patterns.Finally, a stack of three additional layers (FC4, SoftMax, and classification layers; Fig. 2) predicts a single class label for the entire 3-D imaging data sample based on the highest probability score (similar to MDF-RN) using the final output feature vector h n .

D. Training Loss
A two-step training process of both the MDF-RN and STB-CN models was performed sequentially to attain optimal convergence of our proposed classification framework.In the first step, the MDF-RN was trained to exploit and learn the spatial features from the entire training dataset denoted as [F T ] p i=1 , [l T ] p i=1 using a cross-entropy (CE) loss function [9].The initial weights of different residual blocks in MDF-RN were obtained from a pretrained RN [24] that was trained with a large-scale ImageNet dataset using the CE loss function.Therefore, a similar loss function was used to train our MDF-RN model.In the next step, the training and validation datasets were converted into training (denoted as ) feature vectors after processing each data sample through MDF-RN.Subsequently, the second STB-CN model was trained to learn the 3-D anatomical dependencies in the case of 3-D imaging data using the same CE loss function.The overall two-step loss function of the proposed models can be expressed as where ψ 1 and ψ 2 represent the MDF-RN and STB-CN models as transfer functions, respectively.ℒ 1 (•) and ℒ 2 (•) are the CE loss functions.

A. Dataset and Experimental Setup
To perform a quantitative analysis of our proposed classification framework, twenty-six different datasets as used in [9], [32], [33], [34], [35], [36], [37], [38], [39], [40], and [41] were fused to build a single large-scale database that included a total of 151,095 data samples.Consequently, the whole dataset was divided into 70 different classes according to the given ground-truth labels, which included various types of normal and disease categories.In this study, we tried our best to select various publicly available 2-D/3-D imaging datasets related to the medical diagnostic domain.Therefore, we explored numerous publicly available datasets and eventually, selected well-known data repositories from large publicly available collections based on their publication venue.To provide a visual representation of our final dataset, Fig. 3 shows a few example images of each class.In addition, Table I shows the details of each class in our selected datasets in notation L(X/Y/Z) that provides the following information: 1) actual ground-truth label (L), 2) type of imaging modality (X), 3) whether it includes 2-D or 3-D imaging data (Y), and 4) the total number of data samples in terms of the number of images/slices/frames (Z).Most of the 3-D imaging data are related to CT and MRI imaging modalities that do not include time information.However, their length information (as a number of slices) is included in Table I.In detail, the number of slices for each class related to 3-D imaging data was determined by counting the total number of all the slices in each 3-D scan of a particular class.In addition, such meta-information (i.e., number of slices) was also provided in each dataset related to CT and MRI imaging modalities.A few classes (i.e., D8, D9, D10, E1, and E2) comprise endoscopy data encoded at a frame rate of 25 frames per second and having a Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.MAGNETIC RESONANCE IMAGING," "XR: X-RAYS," "VLC: VISIBLE LIGHT CAMERA," "PET: POSITRON-EMISSION TOMOGRAPHY," "US: ULTRASOUND," "ES: ENDOSCOPY," "FC: FUNDUS CAMERA," "CNV: CHOROIDAL NEOVASCULARIZATION," "OCT: OPTICAL COHERENCE TOMOGRAPHY," "DME: DIABETIC MACULAR EDEMA," "GI: GASTROINTESTINAL," "MSI: MICROSATELLITE INSTABILITY," "MSS: MICROSATELLITE STABILITY").NOTE: THE NOTATION "A1, A2, …, G10" PRESENTS "CLASS 1, CLASS 2, …, CLASS 70" variable-length in terms of the number of frames as mentioned in Table I.In the data preprocessing step, all the data samples were resized to a fixed spatial dimension of 224×224 (as the fixed input layer size of our proposed MDF-RN model).Additionally, online data augmentation was performed to resolve the class imbalance problem during the training process.

TABLE I BRIEF DESCRIPTION OF EACH CLASS
The MATLAB (R2019a) coding framework (including the DL toolbox) was used for model development and simulation using a desktop computer with an Intel Core i7 CPU, 16 GB RAM, NVIDIA GeForce graphics processing unit (GPU) (GTX 1070), and Windows 10 operating system.In our optimization scheme, a stochastic gradient descent optimizer [42] with a learning rate of 0.001 was used for training both networks.Various existing studies [43], [44], [45], [46] related to medical image analysis considered such a small learning rate value of 0.001 for the optimal training of their proposed models.Generally, in case of a small value of learning rate, a minimum can be reached eventually; however, it will require many epochs to get there [47].Nevertheless, when the learning rate is relatively large, the training loss drops sharply at first, fluctuates above the minimum, and never decays to the minimum [47].Therefore, we chose a small value of learning rate (as reported in various existing studies [43], [44], [45], [46]) to achieve optimal convergence of the proposed model.We selected mini-batch sizes of 10 and 100 for training the MDF-RN and STB-CN models, respectively.These optimal values for mini-batch sizes were experimentally determined based on the maximum convergence of training accuracies, as shown in Fig. 4. In addition, because of the memory size limitation of GPU, it was not possible to select further higher values (i.e., >10 and >100) of mini-batch sizes.For the other hyperparameters, we used the default parametric scheme provided by MATLAB (R2019a).In all experiments, two-fold cross-validation was performed using 40% (60432 data samples), 10% (15108 data samples), and 50% (75554 data samples) of the whole dataset for training, validation, and testing, respectively.Two-fold cross-validation includes a smaller number of training data and shows lower accuracy compared to ten-fold cross-validation as reported in [48].In addition, various existing studies related to medical image analysis [49], [50], [51], [52] also considered two-fold cross-validation to validate their proposed methods.Therefore, we considered two-fold cross-validation in all the experiments to achieve higher accuracy using a smaller number of training data.In most classes, different patient datasets were chosen for training, validation, and testing.Fig. 4 shows the training/validation losses and accuracies of both networks according to the increment of epoch.The convergence of training curves validates the sufficient training of both networks with training data.In addition, validation curves further confirmed that our models were not overfitted with training data.Numerous medical image classification studies measure the effectiveness of their proposed model with the following top-5 performance evaluation metrics: 1) average accuracy (ACC), 2) F1-score (F1), 3) precision (PRE), 4) recall (REC), and 5) area under the curve (AUC).Therefore, we measured the effectiveness of the proposed model compared to various baseline methods using these five performance evaluation metrics as key indicators.

B. Testing Results (Ablation Studies)
We proposed a cascade of two networks for the classification of both 2-D/3-D imaging data related to the medical domain.Table II presents the quantitative results of our proposed MDF-RN+STB-CN, along with the performance of MDF-RN (our second-best proposed model) and RN (baseline model) as an ablation study.The results in Table II primarily highlight the contribution of multilevel feature fusion using the proposed DFblocks (MDF-RN versus RN) and second-stage STB-CN model (MDF-RN+STB-CN versus MDF-RN) in terms of performance gains.The regularity of this comparative analysis (see Table II) is defined as follows.SR-and PR-blocks) with average gains of 2.54%, 1.71%, 1.23%, 2.16%, and 1.75% in terms of ACC, F1, PRE, REC, and AUC, respectively.Subsequently, the addition of the STB-CN model further improved the performance of the MDF-RN model, with average gains of 0.73%, 2.18%, 2.31%, 2.05%, and 1.36% in terms of ACC, F1, PRE, REC, and AUC, respectively.Ultimately, our proposed MDF-RN+STB-CN model significantly outperformed the RN (baseline model), with average gains of 3.27%, 3.89%, 3.54%, 4.21%, and 3.11% in terms of ACC, F1, PRE, REC, and AUC, respectively.In a t-test analysis, our first proposed MDF-RN achieved an average p-value of 0.001 (p < 0.01), and the final MDF-RN+STB-CN attained a p-value of 0.00003 (p < 0.01) compared to RN (baseline model).These lower p-values (p < 0.01) imply that both networks significantly outperformed the baseline model at a 99% confidence score.
Moreover, Fig. 5 further highlights the performance gain of our proposed MDF-RN+STB-CN (best model) compared to MDF-RN (proposed second-best model) and RN (baseline model) as receiver operator characteristic (ROC) curves.In detail, in case of image-based models (i.e., MDF-RN and baseline RN), we evaluated slice/frame-level classification performance of 3-D imaging data by considering only spatial features.While, in case of our sequence-based model (i.e., MDF-RN+STB-CN), we evaluated block-level classification performance of 3-D imaging data by exploiting both spatial and 3-D anatomical features.Each ROC curve (see Fig. 5) indicates a tradeoff between the true positive rate (TPR) and false positive rate (FPR) at different classification thresholds, ranging from 0 to 1 in 0.01 increments.We attain the optimal validation performance of each model at a particular classification threshold (labeled as operating points in Fig. 5).The values of the optimal operating points for MDF-RN+STB-CN (our proposed best model), MDF-RN (our proposed second-best model), and RN (baseline model) were 0.41, 0.44, and 0.46, respectively.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III PROGRESSIVE PERFORMANCE GAINS OF THE PROPOSED MDF-RN AND MDF-RN+STB-CN MODELS BASED ON MULTILEVEL FEATURE FUSION
Compared with RN (baseline model), our best model significantly reduced the FPR from 14.14% to 11.09% with an average gain of 3.05% [14.14%-11.09%]and increased the TPR from 82.57% to 86.78% with an average gain of 4.21% [82.57%-86.78%].Subsequently, our second-best model also significantly reduced the FPR from 14.14% to 11.52% with an average gain of 2.62% [14.14%-11.52%]and increased the TPR from 82.57% to 84.73% with an average gain of 2.16% [82.57%-84.73%] in comparison with RN.Consequently, our proposed MDF-RN+STB-CN accurately classified a total of 3181 more data samples compared to RN (baseline model).
A detailed ablation study was further conducted to demonstrate the significance of multilevel feature fusion in the proposed image-based model (MDF-RN).Successive feature-level performance was calculated to highlight the contribution of multiscale low-, intermediate-, and high-level semantic features (i.e., f 1 -f 4 ).Subsequently, the same ablation study was conducted for our proposed sequence-based model (MDF-RN+STB-CN).Table III lists the successive feature-level performances of both networks.It can be observed (see Table III) that the fusion of multilevel features (i.e., f 1 -f 4 ) results in a progressive gain, and finally, a high-performance MDF-RN model was attained based on multilevel feature fusion.Similarly, the proposed MDF-RN+STB-CN also showed progressive results with the fusion of multilevel features in the MDF-RN model.
The initial weights of the different residual blocks in our MDF-RN model were obtained from a pretrained RN [24] through a transfer learning approach.Therefore, we also trained our models from scratch to highlight the importance of transfer learning in terms of quantitative performance.For the MDF-RN model, the results indicate that transfer learning compared with training from scratch exhibits superior results with average gains of 20.11%, 18.29%, 16.77%, 19.60%, and 23.02% in terms of ACC, F1, PRE, REC, and AUC, respectively.Similarly, we observed significant performance gains of 18.79%, 16.45%, 15.51%, 17.26%, and 18.86% in terms of ACC, F1, PRE, REC, and AUC, respectively, for MDF-RN+STB-CN.In addition to a large number of data samples in our training dataset, a significant  impact of transfer learning can be observed (see Table IV) in developing our image-based and sequence-based classification models.
In addition, Fig. 6 further presents the clustering of classification results of the final proposed model in terms of the confusion matrix.These results (see Fig. 6) show the individual performance of each class by visualizing the number of false predictions as type I (false-positive) or type II (false-negative) errors.It can be observed that the data samples of classes A1 to C10 show a higher number of false predictions (highlighted with red-box in Fig. 6) as compared to other classes.The high interclass similarities cause such performance degradation due to the following reasons: 1) overlapped body organs in different classes and 2) data samples of different imaging modalities (i.e., CT and MRI) presenting similar types of diseases in different classes.However, the overall performance of our proposed method is significantly improved as compared to other baseline methods (as shown in subsequent Section IV-C).

C. Comparison
This section provides a detailed comparison of our proposed MDF-RN+STB-CN (best model) and MDF-RN (second-best model) with several state-of-the-art image-and sequence-based CAD methods.This is the first study related to the classification of large-scale 2-D/3-D imaging data and no standard benchmarks are given in the literature with the selected dataset.Therefore, we explored the existing literature related to medical image classification and selected fifteen different methods [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21] closely related to our work for this comparison.In detail, all these competitor methods utilized the strength of transfer learning employing existing pretrained CNN models [22], [23], [24], [25], [26], [27], [28], [29], [30], [31] in developing their CAD solutions.All these studies cover a vast scope of 1) disease-specific, 2) modality-specific, 3) multimodality-based, and 4) multi-disease-based CAD solutions.In addition, the source codes of all these methods are also publicly available for a fair comparison.Therefore, we selected these methods for comparative analysis with our proposed solution.To make a fair comparison and provide standard benchmarks, we evaluated the performance of these existing methods with our selected dataset.Table V presents the comparative results of our proposed models in comparison with 15 different state-of-the-art methods.
The regularity of this comparative study is defined as follows: Initially, we compared the performance of our first image-based model with various image-based classification methods [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] using our selected dataset.All these comparative models are labeled as image-based models in Table V.In this comparison, we ignored the 3-D anatomical information of 3-D imaging data by considering the whole data as 2-D imaging data (as explained in Section IV-B).Subsequently, we further compared the performance of our final sequence-based model with three different sequencebased classification methods [19], [20], [21] under the same experimental setting.These comparative models are labeled as sequence-based models in Table V.In this comparison, we also considered the contribution of 3-D anatomical dependencies of 3-D imaging data (as explained in Section III-C).
Compared to various image-based classification methods, our proposed 2-D-CNN model (MDF-RN) shows better results.Jasil et al. [16] proposed a CAD method based on DenseNet201 [16], which showed comparable results and ranked it as the second-best method among other image-based models.However, our proposed MDF-RN model outperformed [16] in terms of quantitative as well as computational performance.In detail, our MDF-RN model outperformed DenseNet201 (used by Jasil et al. [16]) with average gains of 0.64%, 0.75%, 0.85%, 0.66%, and 0.39% in terms of ACC, F1, PRE, REC, and AUC, respectively.In a t-test analysis, our MDF-RN model outperformed DenseNet201 at a 99% confidence score by reaching an average p-value of 0.001 (p < 0.01).In addition, the average inference time (class prediction time of one data sample) of our MDF-RN model was approximately 50% lower than that of Jasil et al. [16].To be specific, our MDF-RN took approximately 13.26 ms, whereas DesneNet201 (used by Jasil et al. [16]) required approximately 25.88 ms for one image.The average inference time was evaluated using the same experimental setup described in Section IV.Moreover, our final MDF-RN+STB-CN model gave better results than MDF-RN and further outperformed the second-best image-based method (Jasil et al. [16]) with average gains of 1.37%, 2.93%, 3.16%, 2.71%, and 1.75% in terms of ACC, F1, PRE, REC, and AUC, respectively.Subsequently, a t-test analysis also showed the superior performance of our final MDF-RN+STB-CN model compared to [16] at a 99% confidence score by reaching an average p-value of 0.001 (p < 0.01).
In the context of volumetric data classification, three different sequence-based models were proposed in the literature [19], [20], [21] using pretrained 2D-CNNs [10], [15], [22] as backbone networks.The performance of these methods [19], [20], [21] was also evaluated for comparison with our proposed MDF-RN+STB-CN (best model) under the same experimental setup.Srinivasu et al. [20] proposed a method based on MobileNetV2+LSTM [15], [20] ranked as second-best Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.among the other two methods [19], [21].However, our final MDF-RN+STB-CN model outperformed the method of Srinivasu et al. [20], with average gains of 1.98%, 2.85%, 3.69%, 2.04%, and 1.62% in terms of ACC, F1, PRE, REC, and AUC, respectively.A t-test analysis also highlights the superiority of our final model over [20] at a 99% confidence score by reaching an average p-value of 0.001 (p < 0.01).The proposed pipeline mainly includes a novel 2-D-CNN architecture (named MDF-RN) that leverages multiscale dilated convolution and a concept of multilevel feature fusion in a mutually beneficial manner to achieve state-of-the-art performance.Additionally, the second subnetwork (STB-CN) further aggregates the overall performance by exploiting 3-D anatomical dependencies in case of 3-D imaging data and results in an additional performance gain.Consequently, the proposed model offers better results compared to various existing methods (see Table V).Fig. 7 further presents the qualitative classification and CB-MISR results of our method compared with all the testing baseline methods [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21].We provided the predicted class label, probability score, and best-matched data sample for each method.We retrieved the best-matched case from the testing database using the predicted class label as explained in [7].
The key objective of this qualitative analysis is to highlight the predictive confidence score and CBMISR performance of our method in comparison with all the baseline methods.It can be observed (see Fig. 7) that our method attains the highest confidence score, particularly in low-contrast data samples (class B8 and G1), and shows a significant resemblance of best-matched cases with the input query data samples.

V. DISCUSSION
In the case of 3-D imaging data, 2-D-CNNs (image-based models) only extract the spatial features from each slice/frame for class prediction.By contrast, 3-D-CNNs extract both spatial and 3-D anatomical features from the entire 3-D sequence and make a class prediction.In the first scenario, 2-D-CNNs neglect    V).In addition to quantitative performance gains, the average inference time of our MDF-RN model was approximately 50% lower than that of the second-best image-based method (Jasil et al. [16]).We also evaluated the performance of various disease and modality-specific models with our selected datasets and observed that our proposed model significantly (t-test:p < 0.01) outperformed these models in case of large cohort (see Table V).For example, our final MDF-RN+STB-CN model outperforms the second-best modalityspecific method of Srinivasu et al. [20] with average gains of 1.98%, 2.85%, 3.69%, 2.04%, and 1.62% in terms of ACC, F1, PRE, REC, and AUC, respectively.Most of the existing methods [18], [19], [20], [21] exploit only high-level semantic features to make a classification decision.Our network design mainly leverages multiscale dilated convolutions (DF-blocks) and multilevel feature fusion in a mutually beneficial manner and finally achieves superior classification results.
Fig. 8 shows a few examples of correctly and incorrectly classified data samples (including both 2-D/3-D imaging data) using our proposed classification framework.To visualize the discriminative capability of our model, we additionally included the top five predicted probabilities along with each data sample.In most of the correctly classified data samples [see Fig. 8(a)], the best probability score was significantly higher (>85%) than the other classes, which highlights the distinctive characteristics of the proposed method.However, a few incorrect predictions [see Fig.Despite the considerable gains of the proposed classification framework, there are a few limitations that may influence the overall performance of our classification framework in a realworld setting.The main concern is the issue of generalizability, particularly with respect to those classes with a limited number of data samples.Second, real-world data can show high intraclass variance due to various types of imaging modalities and may influence the prediction results.However, these limitations can be resolved by including a large collection of diversified and well-annotated datasets.

VI. CONCLUSION
This article aims to develop a deep classification model with the capability to classify multimodal and multiclass medical data, including both 2-D and 3-D imaging data.In particular, a sequence-based deep classification framework (named MDF-RN+STB-CN) is proposed, which mainly leverages transfer learning in the case of volumetric data analysis without influencing the overall training parameters.This is the first study to offer a pretrained classification model in the medical domain based on a large-scale multimodal dataset (including a total of 151 095 data samples related to 70 different classes).Finally, the experimental results exhibited promising performance values of 89.83%, 88.10%, 89.46%, 86.78%, and 93.66% in terms of the average accuracy, F1-score, precision, recall, and area under the curve, respectively, and outperformed various state-of-the-art methods.In a future study, we will explore more heterogeneous datasets and intend to resolve generality issues thoroughly.In addition, the proposed model provides new grounds for future research related to MDS-TL.The strengths of MDS-TL can be further investigated and additional performance improvements can be achieved in numerous medical diagnostic applications.

Volumetric
Model Genesis in Medical Domain for the Analysis of Multimodality 2-D/3-D Data Based on the Aggregation of Multilevel Features Muhammad Owais , Se Woon Cho , and Kang Ryoung Park , Member, IEEE Abstract-The automatic and accurate classification of medical imaging data has potential applications in computer-aided disease diagnosis, prognosis, and treatment.However, it remains a challenge to optimize recent deep learning algorithms in the medical domain for the accurate classification of large-scale three-dimensional (3-D) volumetric data.To address these challenges, we propose an efficient deep volumetric classification network based on the aggregation of multilevel deep features for the accurate classification of large-scale medical 2-D/3-D imaging data.To perform a detailed quantitative analysis of our method, 26 different datasets were fused to construct a single large-scale multimodal database that comprises a total of seventy different classes, including 151,095 data samples.Additionally, 15 different baseline methods were configured under the same experimental protocol for volumetric model genesis and extensive performance comparison with our method.The experimental results of our method exhibited promising performance as an area under the curve of 93.66% and outperformed various state-of-theart methods.Index Terms-Computer-aided diagnosis (CAD), medical data analysis, three-dimensional (3-D) deep learning (DL), volumetric model genesis.

Fig. 1 .
Fig. 1.Comprehensive workflow diagram of the proposed classification framework (MDF-RN+STB-CN), including both training and testing phases.

Fig. 2 .
Fig. 2. Overall architecture of the proposed classification framework including both MDF-RN and STB-CN models.

Fig. 3 .
Fig. 3. Visualization of a few data samples for each class in our dataset.
IN OUR SELECTED DATASETS IS PROVIDED IN NOTATION L(X/Y/Z), WHERE L: ACTUAL LABEL, X: IMAGING MODALITY, Y: 2D IMAGING DATA (I) OR 3-D IMAGING DATA (V), AND Z: TOTAL NUMBER OF DATA SAMPLES.("CT: COMPUTED TOMOGRAPHY," "MS: MICROSCOPE," "MRI:

1 )
In our first comparison (MDF-RN versus RN), we disregarded the 3-D anatomical dependencies of 3-D imaging data by considering the whole data as 2-D imaging data.For the data conversion from 3-D volume to 2-D slices/images, the same class label of each 3-D data sample was considered for its corresponding slices.Consequently, the whole 3-D data samples were converted into 2-D imaging data.2) In our second comparison (MDF-RN+STB-CN versus MDF-RN), we further highlighted the contribution of 3-D anatomical dependencies of 3-D imaging data by introducing our second-stage STB-CN for the additional feature extraction in case of 3-D volumetric data samples (as explained in Section III-C).The first proposed MDF-RN model (comprising SR-, PR-, and DF-blocks) outperforms the RN model (comprising only

Fig. 6 .
Fig. 6.Clustering of classification results of the final proposed model in terms of confusion matrix to highlight the performance degradation of each individual class as type I (false-positive) or type II (false-negative) errors.

Fig. 7 .
Fig. 7. Qualitative classification and content-based medical image and sequence retrieval (CBMISR) performance of our proposed and the various state-of-the-art methods (red box: False predictions).
3-D anatomical information for the sequence of 2-D slices, which can result in performance degradation.In the second scenario, 3-D-CNNs include several trainable parameters and require high computation power for training.In addition, 3-D-CNNs are restricted to process a fixed-length volumetric data and may cause performance degradation in case of variable-length data in a real-world scenario.To address these issues, a sequencebased classification framework is proposed for the accurate classification of both 2-D/3-D imaging data.Initially, our first proposed image-based model (MDF-RN) extracts a set of n multilevel spatial feature vectors (i.e., f 1 , f 2 , f 3 , . . ., f n ) from a given sequence of slices/frames (i.e., F 1 , F 2 , F 3 , . . ., F n ).Subsequently, the second-stage STB-CN model further exploits 3-D anatomical features from a set of spatial feature vectors and performs the final class prediction.In the case of 3-D imaging data, the use of LSTM models with 2-D-CNN makes it more expedient than 3-D-CNNs in terms of computational complexity.In addition, our proposed sequence-based model leverages transfer learning in volumetric data analysis without influencing the overall training parameters.It can also classify variable-length sequences.Our comprehensive ablation study proves the significance of multilevel feature fusion (see TableIII ) and transfer learning (see TableIV ) in developing the proposed sequence-based classification framework for the efficient classification of multimodal 2-D/3-D imaging data.In the context of image classification, our first MDF-RN model outperformed various state-of-the-art 2-D-CNNs (imagebased models) (see Table

Fig. 8 .
Fig. 8. Illustration of (a) correctly classified and (b) incorrectly classified data samples including top five predicted probabilities.
8(b)] may also turn out because of the existence of analogous shapes and texture patterns in different classes.For example, the data samples of class B2 (lung normal) and E10 (lung viral pneumonia) show high inter-class similarities in terms of shapes and texture patterns.Therefore, the input sample of class B2 (lung normal) was incorrectly classified as E10 (lung viral pneumonia) by achieving the best probability score of 57.12%.Fig. 8(b) visualizes a few more examples of such incorrectly classified data samples.Regardless of high inter-class similarities, the poor annotation of data samples can also lead a deep classification model toward false predictions.However, visual assessment can assist medical experts in performing cross-validation of all predicted results.

TABLE II QUANTITATIVE
RESULTS OF OUR PROPOSED MDF-RN+STB-CN (BEST MODEL) ALONG WITH THE PERFORMANCE OF MDF-RN (OUR SECOND-BEST PROPOSED MODEL) AND RN (BASELINE MODEL)

TABLE IV COMPARATIVE
RESULTS OF THE PROPOSED MDF-RN AND MDF-RN+STB-CN MODELS WITH AND WITHOUT PERFORMING TRANSFERLEARNING.("T.L: TRANSFER LEARNING")

TABLE V COMPARATIVE
PERFORMANCE ANALYSIS OF OUR PROPOSED MDF-RN+STB-CN (BEST MODEL) AND MDF-RN (SECOND-BEST MODEL) WITH THE VARIOUS STATE-OF-THE-ART METHODS