LCDEiT: A Linear Complexity Data-Efficient Image Transformer for MRI Brain Tumor Classification

Current deep learning-assisted brain tumor classification models sustain inductive bias and parameter dependency problems for extracting texture-based image information. Thereby concerning these problems, the recent development of the vision transformer model has substituted the DL model for classification tasks. However, the high performance of the vision transformer model depends on a large-scale dataset as well as self-attention calculations between the number of image patches which result in a quadratic computational complexity. To address these problems, the vision transformer must be data-efficient to be well-trained with a limited amount of data, and the computational complexity must be linear with the number of image patches. Consequently, this paper presents a novel linear-complexity data-efficient image transformer called LCDEiT for training with small-size datasets by using a teacher-student strategy and linear computational complexity concerning the number of patches using an external attention mechanism. The teacher model comprised a custom gated-pooled convolutional neural network to provide knowledge to the transformer-based student model for the classification of MRI brain tumors. The average classification accuracy and F1-score for two benchmark datasets including Figshare and BraTS-21 are found 98.11% and 97.86% and 93.69% and 93.68% respectively. The results indicate that the proposed model could have a great impact on medical imaging-based diagnosis where data availability and faster computations are the main concern.


I. INTRODUCTION
The mortality rate due to brain cancer can be minimized by detecting brain tumors of the specific class in the earlier stage. Several imaging techniques such as computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI) have an impact on the earlier detection of brain cancer. Among these, MRI has mostly used imaging techniques in the medical field [1], [2]. The The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . identification of brain tumors from these imaging techniques can lead to a false diagnosis which may cause a threat to life [3]. Therefore, the utilization of an automated system for quickly identifying the correct brain tumor class can aid a significant role in traditional imaging techniques [4]. Currently, automated systems are developed based on advanced technology such as machine learning (ML) [5], [6], [7], [8], [9] and deep learning (DL) [10], [11], [12], [13], [14] algorithms to identify the brain tumor classes precisely. However, the ML techniques have limitations to process image-type datasets as handcrafted feature extraction is needed before data VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ processing [15], [16]. On the other side, DL techniques have the advantages to extract meaningful features automatically before classification [17].
In earlier studies, convolutional architecture dominates the DL field for computer vision tasks such as classification, segmentation, object detection, and recognition. The Pre-trained convolutional neural network (CNN) i.e., residual neural network (ResNet) has outperformed the other convolutional network architecture [18] for the classification task. The CNN models are most case architecture-specific due to depending on the parameter and training procedures. Moreover, it focuses on texture information based on the assumption to generate output using locality and weight-sharing features which leads to inductive bias. Therefore, the vision transformer (ViT) model is developed as a replacement for the CNN model due to having better computational efficiency and scalability. The ViT model incorporates a selfattention-based core transformer model for finding the relations between non-overlapping patches of the image. Then, the parallelly executed multiple self-attention task called multi-head self-attention pays attention to a particular image feature for classifying the images into their actual class using a fully connected dense layer [19]. However, the performance of ViT model is limited to two concerns. One is a large-sized dataset requirement for optimal model accuracy. The other one is quadratic computational complexity w.r.t image size due to the employment of a self-attention mechanism. From these two concerns, the advanced ViT model called the Swin transformer focus on complexity concern which is linear to the image size. The computation is performed by calculating non-overlapping window-based local self-attention. Where the complexity is linear by computing self-attention through a shifted window between consecutive layers [20]. However, the requirement of sufficient data for the high performance of the Swin transformer is still a problem to be looked up. On the contrary, the Data-efficient image transformer (DEiT) is only capable of handling small-sized data with the help of a teacher model. A distillation token helps the student model to adapt the knowledge of the teacher model through attention [21]. Where the multi-head self-attention computes the relation between the patches which leads to quadratic complexity concerning the number of patches. For this reason, the self-attention technique can be replaced by an external attention mechanism that is based on a learnable memory unit to reduce the quadratic computational complexity of classification tasks [22]. Therefore, to overcome the two limitations presented in the ViT model such as enormous dataset requirements and quadratic computation, a model needs to be data efficient and computation needs to be linear to image size without compromising the model accuracy. The primary contributions of this work are outlined as follows: • A linear complexity data-efficient image transformer (LCDEiT) is developed to classify brain tumors that can provide a great impact on future medical imaging fields.
• A custom-gated pooled CNN network is employed as a teacher model to distill knowledge to a transformer-based student model for providing data efficiency by reducing the requirement of a large dataset and contributing to calculating the cross-entropy loss.
• A multi-head external attention mechanism is introduced to provide a linear computation w.r.t number of patches which ultimately reduces model training parameters and time without compromising the classification accuracy. The residual part of the paper is organized as follows: Section II presents an overview of the related research on the classification task. A detailed explanation of the proposed methodology is presented in section III. The description of the dataset and evaluation matrices are outlined in Section IV. The result is described in detail in Section V with some performance measurements. A comparative analysis with the state-of-the-art models is presented in discussion Section VI. Finally, the conclusion and future directions of the paper are drawn in Section VII.

II. RELATED WORK
Earlier studies commenced with the ML algorithm as a base model for computer vision tasks. For instance, M. A. et al. [8] performed a gray-level co-occurrence matrix (GLCM) for statistical feature extraction and discrete wavelet transform (DWT) for brain tumor segmentation which augment the performance and shrivel the complexity. The noise emerged due to segmentation is eradicated by morphological operation and then classification is performed by a support vector machine (SVM) classifier. Moreover, Prabhpreet et al. [9] proposed an MRI brain tumor detection technique including several stages such as tumor segmentation, and statistical feature extraction followed by binary classification into benign and malignant. A modified medial filtering and multi-vector segmentation method support the SVM classifier for tumor classification. For the ML-based classifier, the generation of statistical features from raw images is handcrafted and user-specific which results in degrading model performance. Thereby, the later studies treated CNN as the standard framework for computer vision tasks due to the ability to extract important features automatically from the raw images. Ghosal et al. [18] employed a squeeze and excitation ResNet model based on CNN for the image classification task. In addition, the utilization of zero centering and intensity normalization provided smooth variation in the intensity which increases the effectiveness of the classification task. However, the CNN-based architecture is restricted to local features and the model performance is affected by the inductive biasing problem. Therefore, transformer-based architectures are developed to extract global information from the input images. In one study, Dosovitskiy et al. [19] proposed a ViT model to act on several computer vision tasks including classification, segmentation, detection, and recognition. The individual tasks initiated with making patches from images and feeding the projection of images into the transformer encoder. Then, the output of the encoder was applied for specific task prediction. The requirement of a vast dataset for training is one of the main bottlenecks of this work. Another limitation such as quadratic complexity w.r.t image size is present in the work due to the multi-head self-attention mechanism. To pay attention only to the complexity concern, Liu et al. [20] introduced a Swin transformer where self-attention was computed between nonoverlapping windows, which results in linear complexity to image size. However, the requirement of an enormous dataset is still a problem for the superior performance of the Swin transformer. Moreover, to focus only on the large dataset problem, Touvron et al. [21] proposed a model called DEiT that relied on a distillation token to make a model well-trained with insufficient data. The pre-trained RegNetY-16GF-based teacher model distilled knowledge to the transformer-based student model through a distillation token. Where the student model employed a multi-head self-attention mechanism for final classification that leads to computational complexity quadratic in nature. In another study, Tolstikhin et al. [23] developed MLP-Mixer architecture for computer vision tasks that eliminated convolution as well as the self-attention mechanism. The work was mainly based on multi-layer perception for token mixing and channel mixing separately. Then linear layer was used for the final classification. But the degradation of accuracy compared to ViT and the requirement of large data is still an issue for this work. Moreover, Wang et al. [24] proposed a pyramid ViT that employed a linear-complexity attention layer by spatial reduction attention (SRA) and tokenized images with overlapping patch embedding to extract local continuity of information. The utilization of SRA makes the computational complexity linear, however, the vast amount of data is still necessary to VOLUME 11, 2023 get optimum results. For this reason, Lee et al. [25] utilized shifted patch tokenization (SPT) model to embed more spatial information in the visual token where the spatially shifted images were concatenated with the input image. The Local self-attention (LSA) mechanism sharpens the distribution of attention scores to reduce smoothing problems but the complexity is still quadratic with the image size. Moreover, Trockman et al. [26] designed a conv-mixer model that used convolution for mixing spatial and channel dimensions. Where depth-wise convolution mixed spatial location and after that pointwise convolution mixed channel location to increase data efficiency. This model is designed with compromising the accuracy performance of small-sized datasets. Similarly, the Shift-ViT model introduced by Wang et al. [27] replaced attention with zero parameter shift operation. The model classification was performed by linear layer. The elimination of attention operation in the model results in no complexity concerns but the vast amount of data is still a problem for superior model performance. On the other hand, Zhang et al. [28] used a transformer for covid-19 diagnosis from the chest CT images. After the segmentation of lung images with UNet, Swin transformer is used for feature extraction. However, this model also suffers enormous data requirement problems related to the Swin transformer. To concern this limitation, Zhiqin et al. [29] employed shifted patch tokenization on swin transformer for a specific task of brain tumor segmentation by fusing deep semantics and edge information of multimodal MRI. Despite performing linearly complex able feature extraction using swin transformer, the edge feature extraction is CNN sensitive which may lead to an inductive biasing problem. In another work, a spatialchannel feature preserving vision transformer (SCViT) proposed by Pengyuan et al. [30] extracted long-range dependencies between features and considered the contribution of the different channels in the classification by computing lightweight channel attention. This version of ViT suffers from both limitations such as quadratic complexity and vast data requirement. Similarly, Bazi et al. [31] used a ViT for remote sensing image classification with several data augmentation techniques such as cutmix, cutout, and mixup to get sufficient data to train. Without compromising the accuracy, half of the layers from the model are pruned to reduce parameters and complexity. Moreover, Wang et al. [32] proposed vision transformer-plus (ViT-P) architecture which made a balance between category imbalances by applying deep convolutional generative adversarial networks (DCGAN). Then, channel attention correlated with different channels and obtains important features of each channel for the classification task. The performance of the architectures used in works [31] and [32] is limited by the core two limitations of the ViT model.
In summary, the existing transformer-based classification model suffers from the calculation of self-attention leads to computational complexity quadratic to the number of pixels and the requirement of an enormous dataset for superior classification results. Therefore, the utilization of an external attention-based transformer model as a student model and a customized gated-pooled-based CNN model as a teacher model can overcome the deficiency of the state-of-the-art classification models. Figure 1 illustrates the proposed LCDEiT framework for MRI brain tumor classification. Where a teacher-student strategy allows the student model to learn through external attention and distill knowledge from the teacher model. A gatedpooled-based customized CNN model is utilized as a teacher model that provides data efficiency flexibility to the student model for classification on small-sized datasets. The customized gated pooled CNN is designed to generalize the model based on the data fed into it. The teacher model contributes to calculating and minimizing the total cross-entropy loss in the overall LCDEiT architecture. Moreover, an external attention-based transformer model is employed as a student model which calculates the attention between patches linearly for final classification. Additionally, a descriptive explanation of the overall classification procedure is given in the subsequent subsections.

A. IMAGE PROCESSING
Initially, the raw images of size (512 × 512) are resized into (32 × 32) before patch patching. As the transformer process sequence of image patch tokens, a fixed-size input image is initially converted to non-overlapping patches of fixed size. The raw image, I with dimension (H×W) ∈ R 32 and the patch with resolution (P×P) ∈ R 2 generates a total number of N ∈(H×W)/P 2 ∈ R 256 patches. Equation (1) presents the formulation of patch matrix, I Patching from the raw image. Then, the I Patching are projected to a feature vector using a linear layer that conserves a fixed dimension, D ∈ R 64 which results in a patch token, I p . After that, position embedding, E pos is added to each patch token, I p to retain position information that formulated projected output, I Projection .

B. CLASS TOKEN
A trainable classification token, T c is prepended to the generated I Projection to feed into the transformer encoder. Where the token is applied to the classification head for tumor class prediction. The transformer encoder makes an interrelation between patch tokens, I p and classification tokens, T c through external attention with a dimension of D∈R 64 , but the classification token is only responsible for predicting the final output. Moreover, the classification token calculates the student loss in the training stage and the final class in the testing stage.

C. DISTILLATION TOKEN
Another token called the distillation token, T d is added to the I Projection and T c to establish a relationship through external attention in the transformer encoder. The distillation tokens prediction is contributed to calculating distillation loss in the training stage and the average of prediction from the classification and distillation token is used at the testing stage for final classification.
The input of the transformer encoder, T in are formulated with a dimension of {N+2 (two tokens) × D} ∈ R 258×64 by adding two class tokens such as T d and T c with I Projection simultaneously. The classification and distillation tokens are initialized by zero having dimension, D ∈ R 64 and updated during training.

D. TRANSFORMER ENCODER
The transformer encoder, T E consists of multi-head external attention (EA) followed by multi-layer perception (MLP). The details of the EA are described later in the later subsection. The MLP block consists of two fully connected linear layers with an activation function of the gaussian error linear unit (GELU). The number of nodes in two fully connected layers is equal to the projected feature dimension, D ∈ R 64 . Moreover, a skip connection is maintained on both EA and MLP to ensure feature reusability and solve the degradation problem. The normalization layer in both EA and MLP blocks normalizes the summed input to reduce dependencies between instances. In this work, a stack of eight identical transformer encoders is used having 4-head EA and an MLP block of [64, 64] units. Moreover, TABLE 1 specified all the required parameters of the transformer-based student model in the proposed LCDEiT architecture. Additionally, TABLE 2 depicts the shape and number of parameters of several blocks presented in the student model. The image is patching with dimension (2 × 2), which results in N ∈ R 256 patches per image. Then, the patches are projected to a fixed feature dimension, D ∈ R 64 . Therefore, the shape of the projection block is (256, 64). Then, the T E allows patch token with two extra tokens such as classification and distillation tokens which cause the shape of (258, 64). Both the shape and parameters of T E is retained the same throughout the eight stacked transformer encoders that allow controlling the parameter count of the overall transformer-based student model.

E. EXTERNAL ATTENTION
The core function of the transformer encoder, T E is based on the EA mechanism that provides linear complexity to the proposed method. Figure 2 illustrated a visualization of the complexity assessment of self-attention (SA) and external attention (EA). In the traditional SA mechanism, query (Q), key (K), and value (V) vector is generated from each patch. The matrix multiplication of Q and K results QK T . Then, normalization of this output is again matrix multiplied with V. Calculating attention in this way leads to the requirement of N operation for a single patch where N is the total number of patches in an image. Therefore, the completion of attention calculation for the whole image requires N 2 operation. This functionality is depicted in FIGURE 2(a) where an image consists of N ∈ R 256 patches and calculation of attention for each patch, SA Patch_1 need N ∈ R 256 operations that lead to the quadratic computational complexity O (N 2 ) to calculate whole image attention.
On the contrary, the mechanism of EA computes the pixel-wise relation between patches of images and memory units. Two learnable parameters that are independent of input features are introduced externally as a key memory unit, M k and value memory unit, M V . Only the query vectors generated from normalized patch tokens, T in−norm leads to a reduction of the input-dependent variable. It allows an increment of the robustness of the FIGURE 2. Visualization of complexity assessment of self-attention (SA) and external attention (EA) mechanism. VOLUME 11, 2023 model as compared to self-attention where all three vectors (Q, K, V) are generated from the patch. An illustration of EA mechanism is presented in FIGURE 2(b) where an image having N ∈ R 256 patches and attention calculation for a single patch, EA Patch_1 need only one operation which helps to achieve linear computational complexity O (N) for whole image attention over SA.A generalized version of the whole EA process is depicted in (5).
The computation of correlation between input patches and shared memory is utilized by employing only two linear layers and two normalization layers. The matrix multiplication of the self-query vector and learnable key memory unit, M k is double normalized. Then this attention map, A m is again matrix multiplied with the learnable value memory unit, M V which generates external attention output, T out (see FIGURE 1(b)). In the transformer encoder, the normalization of T in is performed and the produced normalized patch tokens, T in−norm is fed into a multi-head external attention block. For both key and value memory units, the dimension is  Double normalization technique utilizes normalization technique twice by separately normalizing both columns and rows to handle the sensitivity of the input features scale. At first, single normalization is applied to the matrix multiplied output of query from patches and key memory unit. The exponent of each element is divided by a row-wise summation of the exponential of each element where R ∈ R 258 as patch tokens with additional two tokens make a total of 258 rows. Then the output of single normalization, S norm is again normalized by dividing with the column-wise summation of all single normalized vectors which is referred to as a double normalized vector, D norm where C∈R 64 as the feature dimension is 64. The equation of double normalization is depicted as follows: The multi-head EA is performed by repeating the EA computation multiple times in parallel, each of these is referred to as attention head, h. This process leads to a boost in performance by extending the learning capability of the model to capture different aspects of the relation between patches. Query vector from each patch is transformed independently into H linearly projected query vectors using dense layers where H refers to the total number of heads. These projected query vectors, external key, and value memory units are embedded to calculate the attention score H times in parallel. Then concatenation of the H attention score for each head, (h 1 , . . . , h H ) is transformed with another linear projection matrix, W o which refers to the multi-head external attention output, T multi_head . This transformation matrix helps to make the dimension of input and output consistent. (10) In this work, 4-head external attention is computed to extract the relation between patches. Four query vectors are extracted from a normalized patch token and after double normalization, four attention maps are found.

F. TEACHER MODEL
The teacher-student strategy works based on a knowledge distillation framework. Knowledge distillation is a model compression technique where a heavy-weight complex model transfers knowledge to a lightweight student model. A strong image classifier such as a convolutional neural network or transformer can be utilized as a teacher model. Earlier studies imply that the ConvNet teacher model performs better than the transformer-based teacher model. In this work, the core idea of knowledge distillation is utilized in a slightly different way as one of our concerns is to reduce complexity. Instead of taking a heavy-weight model to distill knowledge into a student model, a customized lightweight gated pooled CNN is utilized which can learn complex patterns [40] from the data fed into it in replacement of RegNetY-16GF as used in the traditional teacher model. The less complexity property of the teacher model provides fewer parametric quantity that leads to less computational effort. However, to make the model more responsive to the characteristics present in the features extracted by the Conv layer, a gated max-average pooling layer is employed. The gated pooling function provides a boost of invariance properties compared to traditional pooling which results in reducing the inductive biasing problem [40]. Initially, the input raw images are fed into the convolutional (Conv) layer and rectified linear unit (ReLU) activation function. Then, a gated pooling operation is performed by the dot product of a gating mask, M, and Conv features, x. Finally, fed it into the sigmoid function, σ to get the mixing proportion of max pooling, P max and average pooling, P avg to produce gated pooling, P gated .
Therefore, the mixing proportion is varied depending on the characteristics of the region being pooled. In the teacher model, at first two times, the Conv and gated pooling operation are performed and then used two consecutive Conv layers followed by the RELU activation function. Later, the use of two consecutive Conv layers without utilizing the pooling layer reduces the number of parameters in CNN. Then, instead of utilizing flatten layer, we use global average pooling which reduces each feature map to a single number by taking an average of all pixel values whether flatten layer makes the 2D vector into a 1D vector only. It also helps the model to reduce the number of parameters, hence reducing the overfitting problem. Then two dense layer is used to deeply connect to the neurons and another dense layer with a soft-max activation function is used for the teacher's prediction. The layer properties and parameters of the teacher network having altogether 11 layers are summarized in TABLE 3. The model is trained with the same Figshare and BraTS-21 datasets having an accuracy of 88.00% and 92.85%. The adequate performance of this teacher model helps to learn the local detail information as a distillation token to the transformer-based student model that is normally unable to capture and minimize total cross-entropy loss which ultimately leads to the reduction of misclassification.

G. STUDENT AND DISTILLATION LOSS
Student loss refers to cross-entropy loss calculated among true labels from the original dataset and prediction of classification token. The teacher model contributes to calculating cross-entropy distillation loss among the prediction VOLUME 11, 2023 of the teacher model and distillation token. The total loss is measured by averaging the student loss and distillation loss. Therefore, the total loss, T L is calculated by using equation (14) where P T , P c , P d , and y denotes the prediction from the teacher model, classification token, distillation token, and true label from the original dataset respectively and L CE is cross-entropy loss function. where The equation of cross entropy loss is depicted in equations (13) and (14) where M is the total number of classes in a dataset. For the Figshare dataset, M ∈3, and M ∈4 for the BraTS-21 dataset.

H. CLASSIFICATION
To know the class of the test images, the prediction from the classification token and distillation token are fed into the final classification layer. Equation (15) presents the mean prediction from both tokens to get the predicted class of the test image.
Here, P F is the final prediction and P c , P d are predictions from classification and distillation tokens. The test image fed into the model is patched and concatenated with both tokens. Then the prediction from the classification and distillation head is averaged to get the final prediction.

I. MODEL HYPERPARAMETERS
Hyperparameter selection is an important factor, to train the proposed model for superior results.  proposed LCDEiT model. The model is compiled using AdamW optimizer with a learning rate of 0.00025 and fitted with a batch size of 512. Moreover, the model training is performed for 30 epochs per fold. Furthermore, the categorical cross-entropy is chosen as a loss function to compute student and distillation loss.

IV. EXPERIMENTS
For conducting training and testing of the proposed model, the Google Colab platform is used with Python version 3.7.13. The Model is implemented using Keras = 2.8.0 with TensorFlow = 2.8.2 framework. The NumPy = 1.21.6 and Scikit-learn = 1.0.2 packages have been used for image data preparation and evaluation respectively. During the training, the model occupied 3.80 GB RAM and 38.79 GB of disk space in the Colab environment. In this experiment, a random division of the dataset into ten approximately equal portions is taken and one part in sequence each time is used as the test set and the rest is used as the training set. For every fold, the model is fitted on the different training sets and evaluated on the other test sets. The network is trained using data shuffling in every iteration. Finally, the model evaluation matrices are estimated by taking an average of ten results. The summarization of steps in ten-fold crossvalidation is depicted as follows: • The dataset is divided into ten portions and each containing an equal number of images.
• For each fold, one portion is selected as a test set and the remaining are used as a training set. The selection of portions is changed in every fold.
• An average of ten results is taken to obtain the final result.

A. BENCHMARK DATASET
Two benchmark datasets are employed in this work. One is created by Cheng [33] and acquired from Nanfang Hospital and General Hospital, Tianjing Medical University, China. The database contains T1-weighted contrast-enhanced MRI images of 233 brain tumor patients with three different types such as Pituitary, Meningioma, and Glioma. And, another dataset developed by Baid et al. [34] contains multi-parametric magnetic resonance imaging (mpMRI) scans of 2,040 brain tumor patients with four different tumor classes including fluid attenuated inversion recovery (Flair), native T1-weighted (T1w), T1-weighted post-contrast (T1wce), and T2-weighted (T2w).

B. EVALUATION MATRICES
The most widely used performance indices such as accuracy, precision, recall, and F1-score are considered in this work for evaluating model performance in performing the classification task.
where T P , T N refers to true positive and negative and F P , F N represents false positive and negative respectively. Performance measurement with accuracy utilizes each class in the dataset in an equal manner by taking into account overall true and false, positive and negative values which is effective for a balanced dataset. As real-life datasets may not always be balanced in class, it is efficient to widely use precision, recall, and F1-score as performance parameters. Precision and recall both focused on each class-wise performance in a model. on the contrary, F1-score is employed with averaging precision and recall thus leading to the assessment of the proposed model in terms of the F1-score widely.

V. RESULTS AND PERFORMANCE ANALYSIS
This section presents the results achieved from laborious experiments on two benchmark datasets. For ensuring an effective classification measurement, the training and testing datasets are contained in different folders.

A. QUANTITATIVE ANALYSIS
For the quantitative analysis of the proposed network, the class-wise measurement of precision, recall, and F1-score are evaluated as shown in           analyzed by considering two samples of raw images from two datasets and performing both self and external attention tasks on it as shown in TABLE 9. Where SA removed all the relevant pixels in the target brain region whereas EA keeps all the relevant pixels in that area. Therefore, after performing multi-head attention this single attention map output will particularly be focused on tumor shape-based feature extraction only. In addition, TABLE 10 indicates that the proposed gated-pooled CNN achieves high accuracy by 4.51% as compared to traditional RegNetY-16GFand conventional max-pooled CNN. As experiments show the advantage of the   ability to distill knowledge from this responsive model without compromising model accuracy.This customized gated pooled CNN model is robust and generalized to compensate for the requirement of a more complex model as a teacher model.

VI. DISCUSSION
As there is a GPU memory space constrained, the image scaling has been performed with proper parameters tuning on the proposed LCDEiT model. However, concerning the practical feasibility of the proposed model such as in the medical imaging field, an assessment proposed framework by varying image size and patch size is analyzed in TABLE 11. A constant patch matrix (16×16) is utilized for running the model with a maximum of (256 × 256) image size. Where variation of image size from lower to larger results in deviation of accuracy values (2-3)% without parameter tuning of  the model. If the parameter tuning will be applied, the model accuracy will be similar to the (32×32) image size. Concerning these quite lower significant changes in the results, it can be stated that the developed LCDEiT model is generalized and practically feasible where larger resolution images are preferred. For further assessment of the proposed LCDEiT model, a statistical analysis is performed based on Wilcoxon test [41] to determine the p-value as presented in  In addition, the performance comparison of different classifiers concerning testing accuracy is performed as presented in TABLE 13. This table implies that the proposed model i.e. DEiT with a gated-pooled CNN teacher model and external attention increases classification accuracy, which ultimately leads to reducing misclassification. Moreover, the transformer gives much better classification accuracy than the ResNet-50 transfer learning model. The high performance of the vision transformer is limited to the large-sized dataset. DEiT acknowledges this problem and solves the dependencies of sufficient data but accuracy is compromised slightly. DEiT with a RegNetY-16GF teacher model gives 1.7% and 1.5% less accuracy than the vision transformer for both Figshare and BraTS-21 datasets respectively. The customized DEiT with a generalized and robust teacher model provides an improvement of accuracy of 2.29% than conventional DEiT for both datasets. The customized DEiT with multi-head external attention provides improved accuracy of 98.11% and 93.69% for Figshare and BraTS-21 datasets respectively.
Furthermore, the comparative analysis of the proposed LCDEiT model with the state-of-the-art model is summarized in TABLE 14 and TABLE 15. TABLE 14 presents all the existing models' comparison that uses Figshare data and BrasTS-21 data utilized existing models are presented in TABLE 15. The comparison shows that the proposed LCDEiT model for both datasets provides an improvement in accuracy over another existing technique. In TABLE 14, a customized CNN is employed in [35] for the classification of Figshare data which acquires an accuracy of 95.40%. The performance is degraded by 1.2% in [10] when the genetic algorithm is employed to choose the proper parameter for the network. However, a pre-trained model called ResNet-50 with global average pooling is utilized in [36] and the accuracy is found 97.48%. Another customized CNN is used in [38] with an accuracy of 96.13%. A hybrid model of CNN along with the NADE (neural autoregressive distribution estimation) achieves 95% accuracy. However, the proposed LCDEiT provides 98.11% in Figshare which indicates superiority as compared to the other. On the contrary, in TABLE 15, a pre-trained model called EfficientNetB0 is developed in [37] for the classification of the BraTS-21 dataset with an accuracy of 55.90%. The accuracy is improved drastically by 33.20% when YOLOv5 is utilized for classification purposes. There is around 2% increment in  accuracy when finetuned VGG19 network is used. The degradation of accuracy is observed for utilizing a machine learning algorithm named support vector machine (SVM). The model using pre-trained CNN with correlation-based selection provides 92.67% accuracy. Moreover, LCDEiT has superior accuracy 93.69% over the existing models for BraTS-21 dataset.

VII. CONCLUSION
This paper presents a teacher-student-based LCDEiT framework for categorizing tumors from brain MRIs. The framework consists of a gated-pooled CNN-based teacher model for knowledge extraction followed by image classification with an external attention-based image transformer backbone. The knowledge taken from the teacher model has compensated for the requirement of the vast dataset of vision transformers. The quadratic complexity due to self-attention in the transformer encoder is eliminated by appending external attention in the backbone transformer model that reduces complexity linearly w.r.t the number of patches. The results show that the proposed framework with the backbone of a transformer-based student model achieves the best classification performance with an F1-score of 0.978 and 0.937 for Figshare and BraTS-21 datasets respectively. This reflects the strong applicability of image transformers with a robust learner in the medical imaging field where faster computation is a crucial criterion to initiate treatment of the critical patient. In the future, the imbalance dataset handling approach such as class-wise augmentation could be implemented to overcome the issues related to a greater misclassification rate for lower sample classes. Although the proposed LCDEiT model outperformed for two distinct Figshare and BraTS-21 datasets, the experimental database could be increased further to improve the model's universality.