TD-CNN-LSTM: A Hybrid approach combining CNN and LSTM to classify brain tumor on 3D MRI scans performing ablation study

Identification of brain tumors and accurate grading at an early stage are crucial in cancer diagnosis, as a timely diagnosis can increase the chances of survival. Considering the challenges and risks of tumor biopsies, noninvasive imaging procedures such as Magnetic Resonance Imaging (MRI) are extensively used in analyzing brain tumors. Recent advances in the field of medical imaging with deep learning using three dimensional (3D) MRI is aiding the clinical experts significantly in the diagnosis of brain tumor. In this study, three BraTS MRI datasets named BraTS 2018, BraTS 2019 and BraTS 2020 are employed to classify brain tumor into high-grade glioma (HGG) and low-grade glioma (LGG) where each of the datasets contains four different sequences of 3D MRI brain images named T1-weighted MRI (T1), T1-weighted MRI with contrast enhancement (T1ce), T2-weighted MRI (T2), and Fluid Attenuated Inversion Recovery (FLAIR) for a single patient. This research is composed of two approaches where, in the first part, we propose a hybrid deep learning model named TimeDistributed-CNN-LSTM (TD-CNN-LSTM) combining 3D Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) where each layer of the architecture is wrapped with a TimeDistributed function. The objective of developing the hybrid model is to consider all the four 3D MRI sequences of each patient as a single input data because every sequence contains necessary information on tumor that can have an impact on improving cancer detection performance. However, interpreting all the MRI sequences together with optimal performance especially in 3D is quite challenging. Therefore, the model is developed with optimal configuration based on highest accuracy performing ablation study for layer architecture and hyper-parameters. In the second part, a 3D CNN model is trained respectively with each of the MRI sequences to compare the performance with TD-CNN-LSTM model. In this regard, for both of the models, BraTS 2018 and BraTS 2019 are combined to increase the number of images and used as train dataset where BraTS 2020 dataset is employed as the test dataset. Moreover, before training the models the datasets is preprocessed to ensure the highest performance. Our results of these two approaches demonstrate that the TD-CNN-LSTM network outperforms 3D CNN achieving the highest test accuracy of 98.90%. Later, to evaluate the performance consistency, the TD-CNN-LSTM model is evaluated with K-fold cross validation using different k values. The approach of putting together all the MRI sequences at a time using hybrid CNN-LSTM approach with good generalization capability to classify brain tumor can be used effectively in future Computer Aided Diagnosis (CAD) based research which can aid radiologists in medical diagnostics.

malignant, depending on molecular features, where benign denotes noncancerous cells and malignant denotes cancerous cells. According to the World Health Organization (WHO) [1], malignant brain tumors of grades I and II are considered as LGGs whereas grade III and IV tumors are considered as HGGs. Life expectancy of HGG patients is low, approximately 1 to 2 years, whereas patients having LGGs survive on average 5 to 10 years.
LGGs are typically considered slow-spreading infiltrative tumors [2]. Early detection, accurate grading and correct identification of the type of tumor are imperative as a timely diagnosis can aid significantly in decreasing the mortality rate [3].
The diagnosis of tumors is usually done using two common methods named open brain biopsies and brain imaging. In stereotactic biopsies, after drilling a tiny hole into a human skull, a small slice of tissue is taken out for observation under a microscope. However, this technique is invasive and risky [4]. Risks of the biopsy test include excessive bleeding and damage to the brain caused by the biopsy needle. This may result in severe migraine headaches, stroke, coma, infection, seizures and might even lead to death [5]. Moreover, biopsies are often inefficient, timeconsuming, and life-threatening due to the exceedingly invasive characteristics of the technique [6]. As a solution, imaging techniques, such as single-photon emission computed tomography (SPECT), positron emission tomography (PET), computed tomography (CT), MRI, infrared spectroscopic imaging, or sometimes a combination of all these, have been utilized for the diagnosis and classification of brain gliomas [7]. MRI can distinguish soft tissue and identify small changes in tissue density and physiological changes correlated to tumors. Furthermore, type, dimension, location of tumor and malignancy grade can be determined with conventional and advanced MRI scans which are now commonly employed to differentiate LGGs from HGGs [8]. These scans also known as MRI sequences, apprehend various possessions of tumors depending on different time and intensity settings where each sequence is regarded essential in the detection of different tumor subregions [9]. However, the categorization of gliomas manually by doctors is often challenging and error-prone, due to the hidden and complex characteristics of low-grade and highgrade features. The drawbacks of these diagnostic routines are that they are time-consuming and not always accurate [10]. As the number of cancer cases is growing over the period, it is quite impossible for the doctors to diagnose every patient especially at the primary stage. Moreover, in some rural areas around the world, timely diagnosis becomes hindered due to the lack of expert radiologists. In this regard, a non-invasive computer-aided fully automatic diagnostic system is necessary to aid clinical experts in diagnosis and treatment planning and reduce the mortality rate by providing more reliable, faster and more accurate tumor detection [11]. An automatic medical image examination method can lessen diagnosis time, error, increase the stability of results and thus can reduce the strain on radiologists. CNN based models have had major accomplishments in medical image research and analysis, however there remain challenges in accurately classifying grades using 3D pictures [12]. Moreover, developing a fully automated method with the highest accuracy to categorize brain tumor is still a challenging task, due to the size, texture, localization and intensity similarity of malignant regions to surrounding tissues [13].
In this research, we have introduced two deep learning approaches to classify brain tumor into HGG and LGG to compare their performance, where in first approach a hybrid model combining CNN with LSTM and wrapped by TimeDistributed function is proposed and in second approach a 3D CNN model is employed. Three BraTS 3D MRI datasets are used in this research which includes four 3D MRI sequences: T1, T1ce, T2 and FLAIR for each patient. To date, a number of methods have been developed for the detection, segmentation and classification of gliomas using 3D brain MRI. However, in most studies, segmentation or classification is carried out using a single MRI sequence. To the best of our knowledge, no previous researches have classified HGG and LGG employing four 3D sequences of each patient as a single input. However, radiologists examine all the sequences to determine most accurate tumor grading as all of the sequences contain relevant features [14]. Following radiologist's working principle, training a deep learning model with all of these sequences might have impact on improving interpretation performance. Therefore, the aim of this paper is to propose a deep learning model with highest accuracy which is able to interpret all the 3D sequences of a patient as a single input data. From this concern, in the first approach, the model is developed in such a way that, all the four types of 3D MRI images of a patient can be passed as single input data.
The proposed model is named TimeDistributed-CNN-LSTM (TD-CNN-LSTM), where each layer of CNN architecture is wrapped with a TimeDistributed function. In this process, LSTM is introduced to learn higher level parameters while feature extraction is accomplished using CNN. LSTM can deal with the issue of vanishing gradients efficiently, unlocking certain memory positions in a spatial context. The input layer is configured using a TimeDistributed wrapper to pass the four 3D images of a single patient as one input. In order to get the optimal model architecture depending on the highest accuracy, an ablation study is performed by changing different hyper-parameters and layer architecture. In the second approach, a 3D CNN model is trained using each of the four 3D MRI sequence respectively. The purpose of introducing the second approach is mainly to show the effectiveness of our first approach by analysing all the MRI sequences of a patient at once. To evaluate the performance of both models, evaluation metrics named accuracy, precision, recall, F1-score, specificity, mean absolute error (MAE), root mean squared error (RMSE) and the Area Under Curve (AUC) have been analyzed. The results obtained with the TD-CNN-LSTM and 3D CNN models are compared to determine which technique yields the best performance having consistency. Results suggest that our proposed TD-CNN-LSTM model outperforms 3D CNN in terms of accuracy, which evidences that instead of analysing single sequences, employing all the sequences can significantly aid in improving performance. Moreover, ablation study has been proved to be an effective approach in developing a deep learning model with the highest performance. Therefore, in future medical research on 3D MRI imaging, performance of the model can be improved effectively by employing all the imaging of a patient and carrying out the ablation study.
The rest of the paper is laid out in eight sections. The lLiterature review is conducted in section II which also includes the limitations of previous studies. Section III discusses the aim and scope of this study. Datasets that are used in this research are discussed in section IV and visualization of 3D MRI images is shown in section V. Preprocessing of these 3D images is conducted in section VI. Both of the approaches (hybrid CNN LSTM and 3D CNN) proposed in this study are briefly explained in section VII. The computed results of the entire study is showcased in section VIII along with a comparison with the proposed hybrid approach and existing literature. Lastly, section IX presents the conclusion of this study.

II. LITERATURE REVIEW
In recent years, there have been major advances in the field of Brain Tumor Classification. Deep convolutional neural networks (DCNN) are often used because of their high accuracy but this comes at the cost of a long computational time for each epoch. Jude et al., [15] employed DCNN by modifying the training algorithm to classify brain tumors into four classes. A total of 220 images comprising of T1, T2 and T2 flair sequences, were used in this work. Their proposed model DCNN achieved an average accuracy of 96.4% accuracy in classifying the tumors. Of these images, 80 images were in the training set where four classes have the equal number of train data (20 images). This is a highly balanced dataset and there is a significant shortage in training images. In a multi-class classification problem, usually for such a small number of data, the model cannot be trained properly and thus may result in increasing false interpretation rate. However, 2D MRI with size of 256*256 was used in this study where employing 3D MRI and increasing the number of images might aid to improve their performance. Mzoughi et al [16] proposed a 3D CNN network to classify brain gliomas into LGG and HGG resulting in an accuracy of 96.49% in the validation dataset. T1-graded 3D MRI sequences from 'BraTS' 3D MRI dataset, containing 209 HGG and 75 LGG MRI scans were utilized in this work. However, except for accuracy, no other performance evaluation matrices are explored in this study and employing other MRI sequences might be an effective approach. DeepSeg model has been proposed by the author of this study [13], to detect and segment brain tumors using MRI images. In this research, the BraTS 2019 dataset was employed containing 259 HGG and 76 LGG MRI scans with four sequences where only the Flair MRI sequence was used for segmentation of size 224 × 224. Mohamed et al., [17] created a tumor grading classifier with a pre trained VGG16 model and a CNN based U-net architecture. The Cancer Imaging Archive (TCIA) containing imaging data for LGG MRI with sequences T1 pre-contrast, FLAIR, and T1 postcontrast of total 110 patients were employed for both segmentation and classification tasks. After segmenting the images, using a VGG16 based CNN classifier, they were able to classify LGG into grade II and grade III with an accuracy of 89%, a sensitivity of 87% and a specificity of 92%. However, in tumor segmentation only 2D FLAIR sequence was used. In classification, they used 2D MRI data instead of using 3D MRI which provides more details of the brain tumor. Training and validating a CNN model with different sequences may result in inconsistent performance as these contain different information and features of brain tumors. A similar study regarding this issue was conducted by Banzato et al [18]. They used a transfer learning model GoogLeNet to classify meningiomas and gliomas across all sequences respectively. A total of 80 cases, 56 meningiomas and 24 gliomas were used in their study. Each case offers brain MRI image of 512 × 512 dimensions which was scaled down to 224 × 224 pixels. The authors trained the model using post-contrast T1 images to develop their proposed classifier named trCNN. The highest test accuracy of 94% was recorded with post contrast T1 sequence whereas for Pre contrast T1 and T2 sequences, accuracy of 91% and 90% were achieved respectively. However, this study employed 2D data and the model was trained with each of the MRI sequences individually. A comparison in classification performance between 2D CNN and 3D CNN models in brain tumor classification has been done by the authors of this study [19], They used TCIA dataset comprising of 108 MRI images and the BraTS 2018 dataset containing 210 HGG and 75 LGG images where each dataset contains T1, T1-Gd, T2, and FLAIR sequences. A 3D brain tumor segmentation model, based on U-net architecture, was proposed. The authors recorded an accuracy of 97.1% with 3DConvNet (3D CNN) which was higher compared than the 96.3% accuracy gained with the 2DCNN model. Thiruvenkadam et al., [20] developed six CNN classifiers in order to find the optimal model to classify HGG and LGG lesions in brain MRI images. All their models were constructed by changing the combination of hyper-parameters. For training these models BraTS 2013 dataset was used where 2D slices (240 × 240) are extracted from 3D brain MRI images. Each model had been tested with eight different volumes of Whole Brain Atlas (WBA) dataset and the average results from all eight volumes were calculated. They achieved the highest accuracy of 88.91% from the model five layers with stopping criteria and batch normalization (FLSCBN). Though, 2D CNNs are widely used in the field of deep learning, they are not flawless in computer vision tasks with 3D volumetric data. To address this issue, some researchers experimented with LSTM models to classify brain tumor in 3D MRI images. Such an approach was proposed by Iram et al., [21] who introduced a combination of CNN and LSTM for the classification of 3D MRI images. They used three transfer learning models named AlexNet, ResNet, and VGGNet for feature extraction and classified with a LSTM model utilizing the features respectively. The BraTS 2015 dataset containing four MRI sequences having an image dimension of 240 × 240 × 155 was chosen where only the FLAIR sequence was used for their experimentation. However, researchers balanced the dataset by keeping 60 images in HGG and 60 images in LGG class. Their model performed best with the highest accuracy of 84% using VGG16 as feature extractor. A similar approach was proposed by Rukmani et al., [12] who used VGG16 and AlexNet model for feature extraction and LSTM to classify tumor based on the features. For dataset, BraTS 2015 was used containing 250 HGG and 50 LGG cases of four MRI sequences. However, only FLAIR sequence was utilized for the experiment. With their LSTM model, the highest classification accuracy of 85% was recorded with features extracted by the VGG16 network, whereas features extracted by AlexNet resulted in an accuracy of 70%. Other authors [11] developed a CNN architecture of 22 layers for classifying brain tumors into three classes using 3064 2D slices of T1 weighted contrast enhancement MRI images instead of 3D sequence. Training the model using an image size of 256 × 256 and employing 10-fold cross validation the highest accuracy of 96.56% was recorded. Most of the studies described above contain a few common limitations of employing only a single sequence, employing 2D slices instead of 3D sequences, and not experimenting with the proposed model using different configurations. The performance of their models could be improved with optimal classification accuracy if these issues could be addressed.

III. RESEARCH AIM AND SCOPE
The aim of this research is to mimic radiologist's analysis pattern into a deep learning model regarding to tumor identification using 3D MRI scans. As Artificial Intelligence is developed following the concept of human brain, we hypothesize that, if a deep learning model is developed based on the analysis scheme of radiologists, an accurate, reliable and effective performance might be achieved that can outperform other existing approaches. The main contributions of this study can be summarized as follows: 1. In this study, three 3D MRI datasets named BraTS 2018, BraTS 2019 and BraTS 2020 are used to increase the Our entire interpretability pipeline is represented in Fig. 1.

V. VISUALIZATION OF 3D IMAGE
A 2D image consists of pixels in single or multi-channel whereas 3D MR images comprise of 3D cubes or voxels. For reading, visualizing and writing 3Dneuroimaging data, NiBabel which is a Python package is commonly-used. Loading or reading a NIfTI file using the load() function of NiBabel ensures the encoding of all the information in the file where each detail is known as 'attribute'. When visualizing a 3D image with NiBabel, it initializes a list in which it iterates over all the 155 slices of the 3D volume and whenever a volume is read, each slice is appended sequentially in the list. The number of voxels in a 3D image can be calculated using the following formula: where, Vt is the total number of voxels in an image, St is the total number of 2D slices in an image, Hs is the height of each slice and Ws is the width of each slice.
As explained before, the number of slices in each image is 155, for our study the height and width of each slice is 240 (Fig. 3). The total number of voxels in each image can therefore be calculated as 155 X 240 X 240 = 8928000 voxels per image. To get a better understanding of the 3D sequences and their characteristics, a few images are visualized by converting them into 2D slices.

A. Physics of MRI imaging sequences
MRI provides intense details of the brain, the spinal cord and the vascular anatomy, across three dimensions, axial, sagittal and coronal. Another advantage of MRI is that it can be used to observe blood flow and vascular malformations in brain tissues. However, all of the sequences contain certain and diverse features of brain tumor having different appearances which are analysed to determine the presence and grade of cancer. T1ce images are eminent in visualising blood brain barrier (BBB) interruption, while T2 and FLAIR images are recognized for distinguishing tumor boundaries and peritumoral edema [26]. The 3D FLAIR sequence ensures a good signal with fewer voxels, a high-spatial resolution with high signal to noise ratio (SNR), and good Cerebrospinal Fluid (CSF) suppression without of CSF flow artifacts [27]. The cystic constituents and edema are better portrayed with the T2-weighted than the T1-weighted MRI sequence. In comparison with Axial T2-weighted and Axial T2-weighted FLAIR images, Axial FLAIR image with nulling of the signal from the cerebrospinal fluid shows the metastatic lesions more visibly. FLAIR MRI sequence can aid in tumor detection efficiently by providing precise information on tumor infiltration [13]. However, T1 visualises a LGG tumor better. Due to hyper cellularity, the tumor region appears as a hypo intense signal in T1-weighted images and as a hyper intense signal in T2-weighted images [28]. Moreover, T1 is recommended for the segmentation of tumor from nonaffected brain cells. T1ce makes the tumor borders visible. With the T2 sequence, the edema (fluid) surrounding the tumor is more visible. FLAIR is suitable to distinguish the edema area from cerebrospinal fluid [29]. T2 and FLAIR MRI sequences are suitable for evaluating extracellular fluid in brain parenchyma [30].

1) FLAIR
In the imaging procedure of the FLAIR sequence, the motion of water molecules is suppressed [31]. The FLAIR sequence is produced using very long Time to Echo (TE) and Repetition Time (TR) times that ensure that abnormalities, for instance edematous tissues, remain brighter while the normal CSF appears darker (Fig. 4). Therefore, FLAIR is considered as a highly effective MRI sequence to distinguish the edema region from the CSF. In addition, white matter appears dark grey, the cortex light grey and fat appears light in a FLAIR MRI sequence.

2) T1
T1-weighted MRI sequences are generated with short TE and TR times which make the CSF darker (Fig. 5). T1 is a widely used sequence for the analysis of brain tumor patterns, as it allows an easy annotation of the healthy cells. In addition, white matter appears light, and the cortex grey and fat are bright in a T1-weighted MRI sequence.

3) T2
T2-weighted sequences are generated with long TE and TR times, making the CSF brighter (Fig. 6). In a T2 sequence, the edema region is brighter than in other MRI sequences. Like FLAIR, white matter appears dark grey, the cortex light grey and fat appears light in a T2-weighted MRI sequence.

4) TICE
Finally, in T1ce sequences, the brain tumor borders appear brighter because of the accumulation of contrast agent. This is caused by the interruption of the blood-brain blockade in a proliferative brain tumor area (Fig. 7). Analyzing this sequence, the necrotic core and the active cell regions can be differentiated precisely. Some studies suggest that T1ce is more sensitive than with the other sequences [32]. Moreover, T1ce shows details of regional angiogenesis and the integrity of the blood-brain barrier in the tumor region.  Table 2 summarizes the brain MRI sequences based on their characteristics and pictorial appearance.  SEQUENCE  TR  TE  TUMOR  CSF  FLAIR  VERY LONG  VERY LONG  BRIGHT  DARK  T1  SHORT  SHORT  DARK  DARK  T2  LONG  LONG  BRIGHT  BRIGHT  T1CE  LONG  LONG  DARK  DARK 3D MRI images can be observed in three dimensions namely, Sagittal, Axial, and Coronal, which enables medical specialists to examine the shape of the tumors (Fig. 8).

VI. PREPROCESSING PHASE
Classification of brain tumors using 3D MRI scans is always challenging and computationally intensive due to the complex architecture of 3D images. Pre-processing operations are therefore required to enhance the model's performance [4]. In this research, two pre-processing steps intensity normalization and rescaling are performed to all of the four sequences of the dataset.

A. NORMALIZATION AND RESCALING
As intensities differ in MRI images, due to imaging procedures having different TE and TR, these 3D images need to be normalized. As scanning of patients is likely to be performed in different environments; intensity normalization plays an important role in brain tumor classification.
Data normalization ensures that each input parameter of a model has a similar data distribution transforming the floating-point feature values from their regular range into a new arbitrary standard range that is usually 0 to 1. The distribution of such pixels resembles a Gaussian curve. Data normalization is done when the approximate higher and lower pixel boundaries of an image are identified and data is approximately homogeneously distributed within that particular range. Min-max normalization [33], a widely-used technique, is adopted in our study to normalize the pixels. Algorithm 1 describes the min-max normalization process.
where, x refers to the pixel values (x1,...,xn) and zi is the ith resultant normalized data.
The procedure for computing the normalization used in our paper can be explained using the pseudo-code of Algorithm 1: CALCULATE r = mx -mn //deriving range of dataset by deducing minimum pixel value from maximum pixel value 7.
DERIVE z = s / r // z denotes the normalized value

ENDFOR END
Resizing 3D images accelerate the training process since most existing systems cannot handle the massive volume of 3D image data. After normalization, the dataset is rescaled to 128×128×32 voxels [34] because of the GPU memory limitations. Here, to ensure computational efficiency, only the middle 32 slices are utilized instead of all 155 slices of the brain and the original size of the 3D MRI image (240×240 pixels) was down-scaled to 128×128 pixels [35].

VII. PROPOSED APPROACHES
As stated, in this research, two fully automatic deep learning approaches are carried out for the classification of glioma brain tumors into HGG and LGG. In the first approach, an integrated TD-CNN-LSTM network is employed which receives all the four sequences as a single input. In Fig. 9, an example of the sequences of a single HGG input, denoted as HGG [0], consisting of four 3D images is shown where all four 3D MRI images for a single subject is passed to the TD-CNN-LSTM network at once as an input. In the second approach, we propose a 3D CNN architecture to classify brain tumors, deriving deep feature interpretation from a single MRI sequence, where the model is trained four times for four MRI sequences respectively. For second approach, the data split ratio is kept similar to the first approach with the difference that there are now four pairs of training validation and test datasets (one pair for each 3D MRI sequences).

B. FIRST APPROACH
Classification using just one MRI sequence could potentially lead to substandard performance as all of the sequences contain certain characteristics of brain tumor. Individual MRI sequences comprise independent details whereas all MRI sequences combinedly may provide coherent radiomic and relevant clusters of features [36]. As explained in section V-A, each sequence contains important features and if all the sequences are fed into a deep learning network, the necessary parameters from all these MRI sequences can be extracted effectively. This technique can hypothetically lead to a better performance. Therefore, we integrate LSTM into the CNN architecture wrapped with TimeDistributed function with the aim of extracting the 3D context of slices in a sequential manner. As stated, the optimal model configuration is attained through ablation study by training the model several times with different configurations. Hence, we initially generate a base model, perform ablation study on it and obtain the optimal configuration of our proposed TD-CNN-LSTM model in terms of highest performance.

1) PROPOSED MODEL
CNNs combined with RNNs have been yielding promising results on several complex computer vision tasks. This approach provides effective solutions by detecting the hidden outlines in visual data using back propagation [37]. The proposed model is titled Time-Distributed_CNN_LSTM (TD-CNN-LSTM) as the architecture is deployed by combining CNN and LSTM and wrapped with TimeDistributed layer [38]. This model consists of four parts: 2D convolution for feature extraction, pooling layer for feature reduction over the sequences, an LSTM layer, and a final classification layer [39]. In this hybrid network, CNN handles the spatial dependencies while LSTM deals with the temporal dependencies. Contrary to CNN, LSTM is proficient in sequence processing by adaptively apprehending long-term dependences and nonlinear dynamics of sequential data. LSTMs are powerful in analyzing a dataset of a sequential nature but configuring LSTM with CNN is complex. To overcome this complexity, a TimeDistibuted function is used to configure the input shape and proceed to convolution and pooling layers.
A TimeDistibuted layer typically adds an additional dimension to the input shape of the corresponding argumentlayer. This enables the CNN to receive multiple frames as one input. TimeDistributed layer behaves as a layer wrapper, wrapping the CNN model itself when employed to an input tensor. This wrapper allows the addition of a layer to each sequential slice of input data where inputs can be in 3D. In our experiment, this function is used on each convolution, pooling, and input layer. Here, the input dimension is denoted as input (batch_size, frame, width, height, channel) where batch_size is the pre-defined batch size of the model, the frame is the number of frames to input at once, width is the width of the image, height is the height of the image and channel is the number of slices of the image.
In our study, the batch_size is 'None' which indicates that the pre-defined batch size of the model that will be applied while training the model. As we have four 3D images for each subject, the number of frames is equal to four. The last three parameters (height, width, channel) are 128, 128 and 32 respectively. As we have 32 slices for a 3D image, 32 is denoted as the number of channels for each 3D data. TimeDistributed convolutions are convolutions directed through a TimeDistributed wrapper. This permits the application of any layer to every temporal slice (or frame) of the input individually. In this study, the temporal frames for the MRI data are derived from the 3D volumes [40]. Generally, the TimeDistributed approach is useful when, solving a computer vision task which requires a complex model.
LSTM is a variant of RNN introduced to capture global sequence dependencies of the input data. The spatio-temporal parameters processed by LSTM help the model in identifying hidden outlines in challenging frame-to-frame sequences [37]. In this process, the basic features extracted by the CNN are passed to the LSTM layer as inputs to ascertain the temporal dependencies. After that, LSTM layer receives a sequence of CNN outputs as input accruing the temporal dependencies of the frames of all four MRI sequences [41]. An LSTM memory cell comprises of three components the forget gate, the input gate, and the output gate. The LSTM operation procedure consists of the following steps [42]: • The output value of the last instant and the input value of the present instant is passed as input into the forget gate. The output of the forget gate is calculated using (3): where, the range of is (0, 1), represents the weight of the forget gate, # represents the bias of the forget gate, represents the input value of the present time, and finally ℎ denotes the output value of the last instant.
• The output value of the last instant and the input value of the present is passed as input into the input gate. The output value and candidate cell state of the input gate are calculated using (3) and (4): where, the value range of & is (0,1), is the weight of the input gate, # is the bias of the input gate, . is the weight of the candidate input gate, and # . is the bias of the candidate input gate.
• The current cell state is updated using (6): Where the values range of ( is (0, 1).
• The output and value of ℎ and are inputted into the output gate at time t. The output 2 of the output gate is acquired as follows using (7): where, the value range of 2 is (0,1), 3 is the weight of the output gate, and # 3 is the bias of the output gate.
• The output value of LSTM is acquired by computing the output of the output gate and the state of the cell, applying following formula (8): VOLUME XX, 2017 9 Like RNN, LSTM has time steps and "MEMORY" for each time step. In our study, time steps represent the MRI image sequence. The diagram in Fig. 10 explains the mechanism of LSTM cell at time step t.

FIGURE 10. Operation procedure of LSTM
The Forget gate (f), Input gate (I) and Output Gate(O) employ the Sigmoid activation function and candidate layer uses Tanh as the activation function. The data can be added, removed or updated to the cell state through sigmoid gates [43]. The method of recognizing and eliminating data is determined by the sigmoid function which takes the output of the last LSTM unit (ℎ ) at time t − 1 and the present input ( ) at time t. The sigmoid function decides which parts of the previous output should be eliminated by the forget gate . After inputting the weights vector.dot (U) and previous hidden state.dot (W), these gates concatenate these and apply the activation function, subsequently producing vectors f, C, I, O in a range of 0 to 1 for Sigmoid and -1 to 1 for Tanh for every time step. ( and ( denote the cell states at a particular time t − 1 and t, respectively, and b represents the bias. Memory state C of LSTM cell is where the memory or context of input is stored. This stored information can be modified in different time steps. The output values (ℎ ) are determined based on the output cell state (2 ). A sigmoid layer determines which portion of the cell state will be used for the output. Next, the output of the sigmoid gate 2 ) is multiplied with the values produced by the tanh layer from the cell state (( ).
In our study, the information for all MRI sequences is stored in memory state C. The final output is based on analyzing all features of the four MRI sequences (Fig. 11).

2) BASE MODEL
We initialize our experiment from a base CNN model containing two convolutional layers each followed by a maxpool layer where each layer is wrapped with a TimeDistributed function. To begin with, the network consists of 5 × 5 sized convolutional kernels, the number of kernels in each convolutional layers is set to 32, 'ReLU' is chosen as activation function, 'softmax' is utilized as the activation function for the final layer, 'categorical_crossentropy' is chosen for loss function and lastly, optimizer Adam with a learning rate of 0.001 was chosen. The epoch number for training is set to 100 epochs.
Batch size is set to 16 while training the model. The input shape of the 3D images is 4 × 128 × 128 × 32 where 4 denoted the number of sequences in a 3D data, 128 × 128 denotes the height × width of each 3D slices and 32 refers to depth (number of 2D slices in a 3D image) of each 3D image.

3) ABLATION STUDY
The idea of 'ablation' is based on animal experiments where nerves in particular regions of the brain were eradicated in a selective manner, with the purpose of investigating the behavioural effects of this destruction and thereby obtaining a better understanding of the function of these regions. The term "ablation study" is nowadays mostly used in the context of neural networks [44] to observe the model's performance by studying the effect of altering some components [45]. Therefore, the effects of ablation are investigated on our base model with ten study cases by changing different parameters, the number of layers and the filters to evaluate the impact of these parameters in the proposed architecture [46]. In this way, with the alteration of different components in every study case, the optimal component is chosen and proceeds further to the next study cases based on the highest accuracy. Moreover, the issue of time complexity is taken into account while not compromising accuracy. Thus, after completing all VOLUME XX, 2017 9 the study cases, the best-performed configuration of our proposed TD-CNN-LSTM model can be achieved with highest accuracy and lowest time complexity. The results are described in section VIII-B.

4) TD-CNN-LSTM ARCHITECTURE
After performing ablation study of ten cases on the base model, the optimal model architecture is generated for which the highest performance is recorded with the lowest possible computational complexity. In the architecture illustrated in Fig. 12, the input shape is a five-dimensional data vector (None, 4, 128, 128, 32), where 4 denotes the five MRI sequences to be inputted simultaneously and 128, 128, 32 is the input dimension of every MRI sequence.

FIGURE 12. TD-CNN-LSTM model architecture
The network comprises two blocks containing a total of 15 layers: four convolutional layers and four maxpool layer four batch normalization layers, a flatten layer, and an LSTM layer followed by a dense layer. Each of the layers is wrapped with a TimeDistributed function that allows each layer to be applied to all temporal slices of the 3D MRI sequences. Each block consists of a 2 × 2 kernel sized convolutional layer followed by a 2 × 2 maxpool layer. First, the input layer is fed to Block-1 containing the first convolution layer of 64 filters. This extracts the features from the 3D input image and generates a feature map. LGG. This function takes any real-valued input (x) and generates outputs with a value between zero and one. In this process, a 3D image will be considered as a series of 2D slices that can be represented as X = {x1, x2, · · · , xn} where n denotes the number of slices. From these slices, the convolution layer extracts the required features, and the maxpool layers shrink the individual slices. Finally, the sequence is condensed to P = {p1, p2, · · · , pm} where p is the feature vector of 2D slices. The proposed model is trained with binary cross-entropy as the loss function, using the following equation [5]: where, n denotes the number of samples, = represents the true label of a particular sample and denotes its predicted label.

C. SECOND APPROACH
As explained earlier, to evaluate the effectiveness of our proposed TD-CNN-LSTM model, we have trained a 3D CNN model having 3D convolution and 3D maxpool layer and compared the performance. The model is trained four times on these four training sets of 3D images, leading to different results for the 3D MRIs: FLAIR, T1, T2 and T1ce.
The 3D CNN provides a comprehensive feature map, analysing the volumetric spatial information and integrating nonlinear 3D contextual information [16]. A 3D network detects 2D structures such as edges and corners in 3D manner [47]. The proposed CNN architecture is constructed of four layers: convolutional, sub-sampling, batch normalization (BN) and FC layers equipped with a sigmoid activation function for HGG and LGG classification. The input layer of the proposed 3D model receives a multichannel brain MRI which can be denoted as [5]: where, m denotes the number of channels and w and h represent the width and height of the channels respectively. Compared to a 2D model, 3D CNN is computationally intensive and requires more memory. Each convolutional layer of our proposed network is structured with a 3 × 3 kernel which enables the model to perform faster as smaller size convolutional kernels yield improved efficiency due to the lesser number of trainable parameters. The function 3D convolutional layer can be stated as [48]: here, voxel positions for any given 3D image are denoted as x, y and z, respectively, the weight of j-th 3D kernel, connecting the k-th feature maps of the l-1 layer and the j-th feature maps in the layer l, is represented by JK L R , R S , R T $ , the k-th feature maps in the layer l-1 is Q J L , and the kernel sizes corresponding to the x, y and z are R , R S , and R T respectively The kernel filter's convolutional response is I JK L ( , =, ).
Rectified linear unit (ReLU) which is a commonly used activation function for deep CNN due to its computational efficiency and reduced likelihood of vanishing gradients is employed in each convolutional layer. The function can be defined as [48]: In a 3D max-pooling procedure, the maximum value in the cubic region is the input to the consequent feature volume. The equation can be stated as [49]:  (13) where, ℎ L = activation of layer l.
A BN layer is utilized to condense preliminary covariate shift. The FC layer classifies the data, performing nonlinear operations on the generated parameters to obtain the output [49]. L L ℎ L (14) where, ℎ L = activation of layer l and L = learnable

1) 3D CNN ARCHITECTURE
The 3D CNN architecture is comprised of five Blocks and a depth size of 16. Each block has a 3D convolutional layer with a kernel size of 2×2, followed by a 2×2 maxpool layer and a BN layer. The input layer can take 3D images of size 128,128,32,1 as input. The 3D CNN architecture is illustrated in Fig. 13

A. EVALUATION METRICS
Often, while doing binary classification, a single metric (accuracy score) is used to evaluate the performance of the classifier. This approach is flawed, however, as the classifier might yield poor results upon evaluation with other evaluation metrics. To properly evaluate the effectiveness of a binary classifier, several evaluation metrics are required. Keeping this in mind, both the TD-CNN-LSTM and 3D CNN models are evaluated using different classification metrics: precision (Pre), Recall, (Rec), Specificity (Spe), F1score (F1) and accuracy (ACC). Two model error metrics: MAE and RMSE are also calculated. AUC value is calculated by generating Receiver operating characteristic (ROC) curves for all models [44].

B. RESULT OF ABLATION STUDY
Ablation study concludes with various changes of elements in the base CNN model and the results are recorded. Time complexity [50], and test accuracy are evaluated for each experimental configuration. Time complexity in theory can be defined as [51]: here, j indicates the index number of convolutional layers, and k refers to the total number of convolutional layers, v K indicates the number of kernel or input channels in the j -1 th convolutional layer, and v K refers to the number of kernels in the j th layer, F and G indicates width and height of the kernels on j th layer, and U F and U G indicates the width and height of the generated feature map correspondingly.
Results of the ablation studies conducted with the testing dataset (external) are presented in Table 3 and Table 4 where Table 3 concludes all the results relating to the model's layer configurations and activation functions and Table 4 showcases the results of tuning hyper-parameters, loss function and flatten layer. Here, the initial configuration of the base model mentioned above is kept as it is while changing the number of convolution and maxpool layer. Table 3 shows the performance for different configurations of the altered model architecture. Best performance is achieved from configuration 3 and 4 (Table 3) with a test accuracy of 95.06%. But regarding time complexity, configuration 3 containing four pairs of convolutional and maxpool layers shows a lower time complexity (116.32 million) than configurations 4. Taking this into consideration, configuration 2 was selected for the rest of the ablation case studies.
Case study 2: changing filter size In this case study, experimentations with various kernel sizes of 3 x 3, 2 x 2 and 5 x 5 are carried out to observe the performance [52]. It is evident that changing filter size does not affect much on the overall performance (Table 3). However, the highest test accuracy, 95.91%, is acquired by employing the kernel size 2 x 2. Thus, configuration no 2 is chosen for further ablation case studies.
Case study 3: changing the number of filters Initially, we started with a constant number of kernels [53] for all the four convolution layers (32, 32, 32 and 32). Later, number of features is increased to 64 and significant improvement in performance is found. However, we anticipated that gradually increasing might be a better approach. This is represented in configuration 3, 4 and 5 (Table 3). It is observed that configuration 5 with filter numbers 64, 64, 64 and 128 for the four convolutional layers achieved the highest performance with a test accuracy of 97.75%. Therefore, we move forward with configuration 5.
Case study 4: changing the type of pooling layer Two pooling layers, maxpool and average pool, are evaluated [50] where both pooling layers gained the same highest accuracy of 97.75% (Table 3). In this regard, Max pooling layer is chosen for further ablation studies.
Case study 5: changing the activation function As various activation functions can have an impact on the performance of a model, choosing an optimal activation function is relevant in building an optimal model. Six activation functions, Linear, PReLU, ReLU, Leaky ReLU, Tanh and Exponential Linear Units (ELU) [54] are experimented with. It is found that the linear activation function performs best with a test accuracy of 98.33% (Table 3). This activation function was chosen for further ablation studies.  (Table 4) was recorded with the learning rate 0.0001 and the Nadam optimizer. Visual representation of the gradual performance boost with different ablation study cases is shown in Fig. 14 for better understanding.
After completing the ablation studies on the base model, the proposed TD-CNN-LSTM model is acquired where a significant improvement in classification accuracy is observed. Configuration of TD-CNN-LSTM model is presented in Table 5.

C. PERFORMANCE COMPARISON WITH 3D CNN MODEL
Comparison of two approaches regarding various accuracy and error matrices is conducted in this section. In this regard, the training configuration of 3D CNN model is kept similar to TD-CNN-LSTM model where, Nadam optimizer is used with a learning rate of 0.0001, batch size of 16 and number of epochs 150. Classification results acquired from the TD-CNN-LSTM and 3D CNN models, which are trained on their respective datasets and tested on the respective external testing dataset are shown in Table 6. The results of our proposed model are bold on the tables.    The RMSE of all the datasets is shown in Fig. 16. Like MAE, the lower the value the better the classifier performs. It can be observed from Fig. 16 that the TD-CNN-LSTM model with the combined dataset had the lowest RMSE value (13.86). 3D CNN model performed moderately with the FLAIR, T1 and T1ce datasets.

D. OPTIMAL MODEL EVALUATION
In order to further evaluate the effectiveness of TD-CNN-LSTM model on the external dataset, a confusion matrix and ROC curve is generated. Moreover, experimentations with various K-folds configurations are conducted to observe the robustness of proposed model.

FIGURE 17. Confusion matrix of combined dataset trained on TD-CNN-LSTM model
A confusion matrix for the results of TD-CNN-LSTM model tested on the external dataset is presented in Fig. 17.
The row values represent actual labels of the test dataset and the column values represent predicted labels of the test dataset. In this case there is a binary classification with classes HGG and LGG. Diagonal values from top left to bottom right are TP and TN values. It can be seen that all HGG cases were predicted correctly and that there was just one misclassification for LGG. The TD-CNN-LSTM classifier trained on the combined dataset does not seem to be biased towards any one class which validates the robustness of the model. ROC probability curve is plotted, and the AUC value is derived (Fig. 18). The AUC is used as a summary of the ROC curve, representing the performance of a model in distinguish classes. An AUC value of 1 indicates that the model is able to detect all the TP and TN values flawlessly. Fig. 18 shows that the ROC curve touches almost at the pick of y-axis with a false positive rate close to 0 and a true positive rate closer to 1. The AUC value is 99.04%.
In order to justify the consistency of the performance of the model even further, we test the model using a total of twelve K-fold cross validation configurations with various K values ranging between 3 to 30. Findings of each K-fold cross validation are presented in Fig. 19. It is evident that, for all the K-folds the model is able to yield good performance (>98%). The performance did not drop significantly for any fold, which further adds to the robustness and consistency of our model. It also indicates the superiority of the proposed approach (combining all four MRI sequences and using hybrid CNN and LSTM model) in 3D MRI classification problems where the model is able to generate such constant performance.
The best performing classifier TD-CNN-LSTM trained on the external combined dataset produced promising results across all evaluation measures. The TimeDistributed 2D convolutional layers in TD-CNN-LSTM model are capable of analyzing the combined 3D dataset. As discussed earlier, this combined dataset feeds a set of 3D images to the classifier simultaneously, taking less than 3 seconds per epoch in training. The model is able to achieve highest test accuracy in under 80 epochs whereas a traditional 3D CNN model takes much longer to be trained (150 epochs) to achieve moderate accuracy. Since the four sequences in 3D MRI images contain features of different aspects of brain tumors, combining all these sequences into a single input makes the tumor region more defined which makes it easier to identify tumors. The entire approach of combining all the MRI sequences while training gives the model more definition of the tumor region that consequently boosts the classification performance while maintaining a low training time. Furthermore, the fact that the model is able to give a 98.90% correct prediction on an unseen external test dataset adds to the effectiveness of this approach on real world datasets. Moreover, the fact that TD-CNN-LSTM along with the combined dataset had similar performance across multiple K-fold configurations is an indication of the robustness of the proposed approach. In this section, the proposed TD-CNN-LSTM model is compared with some recent studies in brain tumour classification (section 3). Table 9 shows a comparison between these previous studies and our proposed approach based on accuracy. As stated before, the proposed TD-CNN-LSTM model, trained on the combined data set has the best results in our study, achieving an accuracy of 98.90%. More than half of the studies shown in Table 9 used similar 3D Brain MRI datasets and two studies used classifiers that are somewhat similar to our TD-CNN-LSTM model. Most results in Table 8 are in the range of 85%-97%, which is lower than ours. Hiba at el [16] used BraTS 2018 dataset achieving an accuracy of 96.49% with Deep CNNs. However, only T1-Gado MRI sequence was utilized. Unlike the previous work, Thiruvenkada at el's study [20] used 2D slices of the T2 sequence and managed to achieve an accuracy of 88.91%. Jude et al [15] proposed a Modified Deep CNN classifier which can classify 2D MRI slices and demonstrated an accuracy of 96.4%. Mohammad at el [17] used 2D slices from T1 pre-contrast, FLAIR, and T1 postcontrast sequences and using modified VGG 16,obtained an accuracy of 89%. A similar study was done by Tommaso at el [18] where with GoogleNet model and 2Dslices of pre and post-contrast T1 and T2 sequences, they were able to achieve an accuracy of 94%. A fairly recent study conducted by Ying Zhuge [19] used a 3D CNN model and gained anaccuracy of 97.1% using T1, T1-Gd, T2, and FLAIR sequences. Linmin at el [26] proposed a 3D CNN model that can train on 3D MRI images and they achieved an accuracy of 58.6% which is the lowest of all the studies shown in Table 8. Iram at el [21] and Rukmani at el [12] studies are more comparable to this paper. These studies used BraTS 2015 and BraTS 2018 respectively and used LSTM model which is a bit similar to our study. However, they used 2D slices for HGG and LGG classification. BraTS with LSTM, Iram at el [21] achieved 84% accuracy whereas Rukmani at el [12] achieved 85% accuracy score using the AlexNet-LSTM model.

E. COMPARISON WITH SOME EXISTING LITERATURE
Our proposed TD-CNN-LSTM model outperforms all these studies with a test accuracy of 98.90%. The idea of using four 3D images of a particular subject as a single input to a deep learning model contributed to the effectiveness of our mode. With this process more details of the brain tumour can be learned by the model than the other approaches described in literature. Training on this combined dataset makes the training process less time consuming and highly detailed 3D images make it easier for the classifier to distinguish between HGG and LGG tumors. Moreover, in this research, vigorous K-fold cross validation experimentations are performed which concludes in consistent performance of the model in all 12 K-fold configurations. A large external test dataset (265 samples) ensures that the model is tested properly for every ablation study cases in building the model that adds to sound evaluation of the robustness of proposed approach. Furthermore, the optimal accuracy achieved from an external test dataset gives a glimpse of the models's absolute interpretation of 3D MRIs in real world application. With this proposed approach TD-CNN-LSTM trained on the combined dataset can outperform all other studies while also requiring fewer epochs for training. This demonstrates the potential of the proposed approach in classifying brain tumors from 3D MRI scans.

IX. CONCLUSION
The aim of this study is to mimic medical expert's diagnosis process of analysing all the MRI sequences of a single patient to determine cancer, into a deep learning approach with the highest accuracy and lowest computational complexity. A completely automated hybrid CNN model named TD-CNN-LSTM combining CNN with LSTM for classifying brain tumors into HGG and LGG using 3D volumetric MRI is proposed in this paper, the input layer is wrapped with TimeDistributed function so that all the four MRI sequences of a patient can be passed as single input data. In addition, a 3D CNN approach is employed to train with MRI sequences independently to compare the performance of our proposed model. to compare the performance of our proposed model. Results show that the TD-CNN-LSTM network with optimal configuration outperforms 3D CNN model achieving the highest accuracy of 98.90% with optimizer Nadam and learning rate of 0.0001. Moreover, the results of K-fold cross validation method validate the robustness and consistency of our model's performance over different training scenarios. Our method of analysing all the MRI sequences together and developing a CNN model employing ablation study in order to achieve the highest accuracy while keeping the lowest time complexity, can be a useful approach in future research and real word tumor diagnosis system.