Brain Tumor Segmentation Using Partial Depthwise Separable Convolutions

Gliomas are the most common and aggressive form of all brain tumors, with medial survival rates of less than two years for the highest grade. While accurate and reproducible segmentation of brain tumors is paramount for an effective treatment plan and diagnosis, automatic brain tumor segmentation is challenging because the lesion can appear anywhere in the brain with varying shapes and sizes from one patient to another. Moreover, segmentation is only done by analyzing pixel intensity values of surrounding tissues, and the diffusing nature of aggressive brain tumors makes it even more challenging to delineate tumor boundaries. Nevertheless, deep learning methods have superior performance in automatic brain tumor segmentation. However, their boost in performance comes at the cost of high computational complexity. This paper proposes efficient network architecture for 3D brain tumor segmentation, partially utilizing depthwise separable convolutions to reduce computational costs. The experimental results on the BraTS 2020 dataset show that our methods could achieve comparable results with the state-of-the-art methods with minimum computational complexity. Furthermore, we provide a critical analysis of the current efficient model designs. The code for this project is available at https://github.com/tmagadza/partialDepthwiseNet.


I. INTRODUCTION
Gliomas are adults' most common primary tumors. Although their exact causes are still a mystery [1], risk factors include exposure to ionizing radiation and a family history of tumors. These tumors can appear anywhere in the brain with varying shapes and sizes, making them difficult to segment. The World Health Organization (WHO) has classified the tumors into four grades, from grade I to grade IV, depending on growth and aggressiveness. Low-grade gliomas (LGG), which constitute grades I and II, are less aggressive and have survival rates of several years. While high-grade gliomas (HGG) (grade III and IV) are much more aggressive and have median survival rates of less than two years even after treatment.
Magnetic Resonance Imaging (MRI) has emerged as the imaging technology of choice for brain tumor diagnosis, and treatment planning [2]. Non-invasive MRI scans produce The associate editor coordinating the review of this manuscript and approving it for publication was Wai-keung Fung .
high-resolution and soft tissue 3D volumes. As depicted in Fig. 1, more than one MRI slices are used to view different tumor regions.
In clinical practice, highly trained radiologists do brain tumor segmentation manually. Although manual segmentation arguably produces the most accurate segmentation results, it suffers from intra, and inter-rater variability [2], [3]. Moreover, it is tedious and time-consuming, and results depend on the radiologist's experience and knowledge. To this end, manual segmentation is mainly used for visual inspection and is a gold standard for semi-automatic and fully automatic segmentation.
Meanwhile, automatic segmentation methods require little to no human involvement. They have the benefits of being objective, reproducible, and well-suited for quantitative assessment of brain tumors. They have shown great potential in improving diagnosis and treatment planning.
Recently deep learning methods, particularly the Convolutional Neural Networks (CNNs) [4], [5], [6], [7], are being used to automatically analyze brain scans (usually MRI scans) due to their record-shattering performance. They require no feature engineering: they automatically learn features directly from data. However, these methods have high memory and computation complexity. Furthermore, they require a huge amount of training data for better performance, which is a challenge in medical imaging.
Currently, research efforts in fully automatic brain tumor segmentation are limited to the available computation budget [8]. Batch sizes and model complexities are now being limited to what can fit into the available GPU memory. The use of 3D MRI volumes with large patch sizes in CNN models, which were empirically shown to outperform 2D counterparts, makes it even more difficult, if not impossible, to train these models. Therefore, to improve the adoption rate of computerassisted diagnosis in clinical setups, especially in developing countries, there is a need for more computational and memory-efficient models. Luckily, there has been an increase of research efforts to optimize the current state-of-the-art deep learning models in computer vision task [9], [10], [11], [12], [13], [14].
The contributions of this research work are: 1) We proposed efficient network architecture for 3D brain tumor segmentation, partially utilizing depthwise separable convolutions to reduce computational costs. 2) We quantitatively analyze the computational complexity of the proposed method and compare the segmentation performance with the state-of-the-art. 3) We provide critical analysis of the latest methods that employ efficient model design. The rest of the paper is organized as follows: Section II reviews related work in efficient networks. Section III describes the proposed architecture for an optimized 3D brain tumor segmentation. Section IV presents the experimental results discussed in Section V. Lastly; Section VI provides concluding remarks.

II. LITERATURE REVIEW AND RELATED WORKS
Brain tumor segmentation is the process of classifying every pixel in a medical image as a normal or tumorous pixel. The process is done before and after treatment to determine the disease's progression and evaluate the effectiveness of the chosen treatment strategy [2]. It is very challenging to accurately segment brain tumor for several reasons: (1) segmentation is only achieved by the analysis of intensity variations between surrounding tissues [2], (2) brain tumors comes in various shapes and sizes from one patient to another, (2) aggressive brain tumors often diffuse into surrounding normal tissues making it even more difficult to delineate tumor boundaries. Fig 1 clearly shows that a single imaging modality is insufficient to delineate tumor boundaries accurately. When done manually, brain tumor segmentation is tedious and suffers from intra, and inter-rater variability [3]. Accurate and reproducible segmentation of brain tumors is critical for effective treatment planning, diagnosis, and monitoring of disease progression. In recent years, computer-assisted diagnosis has become mainstream in assisting medical practitioners in interpreting medical images [8], [15], [16]. While there are several methods for the automatic segmentation of brain tumors, deep learning methods are becoming widespread in the medical imaging domain [17] due to their resounding performance. However, the boost in performance comes at the cost of high computational complexity, as we shall see later.
Among the deep learning family, U-Net architecture [18] has emerged as the architecture of choice, primarily for the semantic segmentation of medical images. The architecture is composed of downsampling and upsampling paths. The downsampling path, which resembles a typical convolutional network, is used for feature extraction. At the same time, the upsampling path is used to recover the spatial resolution lost during feature extraction. The network heavily depends on data augmentation for better generalization. Since its inception in 2015, the architecture has inspired many research efforts in medical imaging. In [19], the U-Net network was improved to take 3D volumes as input to fully exploit the volumetric data inherent in medical images. However, volumetric segmentation substantially increases the computation requirements. Kamnistas et al. [20] proposed an ensemble of multiple heterogeneous models (including the U-Net-based models) for robust semantic segmentation. Despite winning the BraTS 2017 challenge, their model is highly inefficient as each model has to be trained separately. In [21], Wang et al. exploited the hierarchical nature of brain tumor structures by proposing a cascade of U-Net models. Isensee et al. [22] incorporated context and localization modules for better segmentation performance. Myronenko [23], the winner of the BraTS 2018 challenge, used an autoencoder to regularize a shared decoder in the U-Net variant. His model suffered from high computational complexity due to the large patch size (160x192x128), standard convolutional operations, and additional overhead due to the use of an autoencoder. Isensee et al. [7] clearly showed that a U-Net architecture with minor alterations can achieve superior performance. However, large patch sizes (128x128x128) and standard convolutional operations will result in high computational and memory requirements. Jiang et al. [6] proposed a cascaded U-Net that took advantage of the hierarchical nature of brain tumor substructure. Despite winning the BraTS 2019 Challenge, their model is still computationally expensive. Zhao et al. [4] exploited various heuristics in data processing, model designing, and optimization to improve segmentation performance. Their work came second in the BraTS 2019 Challenge. Isensee et al. [24], the winner of BraTS 2020 challenge, used the nnU-Net framework [25] with BraTS specific modifications in post-processing, region-based training, and data argumentation demonstrating the competitiveness of the U-Net model. The models that follow the encoder-decoderlike structure, as in the U-Net have achieved state-of-the-art performance. However, most of the works focused mainly on improving the segmentation performance and the expense of the computational complexity. In this work, we introduced yet another U-Net model that follows on the works by Myronenko [23] and Ellis and Aizenberg [26] for a more efficient volumetric segmentation.
To learn recent trends in efficient model design for brain tumor segmentation, we performed a Google Scholar search for recent works with efficient in their title or mentioned FLOP in their body for a period from 2018 to 2022. shows that of 1630 works for brain tumor segmentation, only 39 (2%) reported on the computational complexity of their methods. Surprisingly, of 44 works with efficient in their title report, only 8 ( 18%) reported on the computational efficiency of their models. These results indicate that the majority of works emphasize more on improving segmentation performance while sacrificing computational costs.
In Table 1, we summarized the works that provided an analysis of the computational complexity of their methods which is measured by the number of parameters, floatingpoint operations per second (FLOPS), and the GPU memory requirements for a given model. From the table, most works for the period use 3D patches with input size cropped from 240x240x155 to 128x128x128 pixels to fit on the GPU memory. The batch size depends on the available GPU memory. Since a large patch size consumes much of the memory, the researcher has to make the trade-off between increasing the batch size and reducing the input patch size, which in turn hurts the segmentation performance [23]. Another way is to maintain the large patch size and increase the number of GPUs. In reality, most researchers have a very tight computational budget. We have observed that several works [27], [28], [29], [30] exploited channel grouping to minimize the interaction between the feature maps when performing convolutional operations, thereby the reducing the number of parameters and FLOPs.
Our work is inspired by depthwise separable convolutions introduced by Sifre and Mallat in [31] and subsequently used to improve the efficiency and reduce the model size of 2D convolutional networks in [10] and [9]. Furthermore, we extensively use residual connections introduced by He et al. [32] to improve the flow of gradients in deep networks.

III. METHODS AND TECHNIQUES A. STANDARD CONVOLUTION
Consider the input feature maps I ∈ R h×w×d×c , where h, w, d, and c are the height, width, depth and number of channels of the input feature maps respectively, and the convolutional kernel K ∈ R k×k×k×c×n , where k is the size of the convolutional kernel and n is the number of output channels. The operation of a standard convolutional layer O ∈ R h×w×d×n = K * I is given by: .
The computational complexity of a convolutional layer in terms of the number of multiplications is The complexity of the standard convolution is cubic, with the kernel size limiting the kernel size of most CNN in medical image analysis to 3 × 3 × 3.

B. DEPTHWISE SEPARABLE CONVOLUTION
The depthwise separable convolution splits the standard convolutional operation into depthwise and pointwise convolutions. First, it independently applies a spatial convolution to each input channel. It then performs a 1 × 1 convolution to combine the results. A standard convolution performs these operations in a single pass. Factorization of the convolutional operation has the benefit of improving efficiency and reducing the model size.
Depthwise convolution with one filter per input channel can be expressed as u, v, w, c).
where K D ∈ R k×k×k×c is the depthwise convolutional kernel where the c th filter in K D is applied to the c th channel in I to produce the c th of the output feature map O D ∈ R h×w×d×c . The computational cost of the depthwise convolution is: whereas a pointwise convolution can be expressed as: where K P ∈ R 1×1×1×c×n is the pointwise convolutional kernel. The computational complexity of this operation is, therefore: nchwd.
The combination of depthwise convolution and pointwise (1 × 1) convolution is called the depthwise separable convolution. The computational complexity of the depthwise separable convolution is C. MODEL ARCHITECTURE Our work follows a 3D U-Net [19] structure as shown in Fig. 3. The network is made up of five layers, with two ResNet-like [32] style convolutional blocks in both the encoding and decoding path. The encoding path takes in a random four-channel 3D MRI patch with a receptive field of 128 × 128 × 128. Each layer along the encoding path reduces the spatial resolution by half using stride convolution and doubles the number of the channels starting with a base width of 32 channels. As in [26], each residual block consists of two consecutive convolutional blocks performing group normalization, followed by rectified linear unit activation, and a 3 × 3 × 3 convolution (see fig 5a). Along the decoding path, each layer reduces the number of feature maps by half before upscaling the spacial resolution using trilinear interpolation and concatenates the result with gated high-resolution feature maps from the encoding path. In the last layer, the network uses a 1 × 1 × 1 convolution to reduce the number of feature maps to three, followed by a sigmoid activation function.
To improve the computational efficiency of the network, one can replace all the standard 3 × 3 × 3 convolutions in residual modules with depthwise separable convolutions. However, empirical studies reviewed that the group convolutions in PyTorch 1 deep learning framework, which models 3D depthwise separable convolutions, tend to use more GPU memory than standard convolutions. Therefore, to allow our network to fit available GPU memory, we only replaced the bottom three layers of the network with depthwise separable convolutions. Fig 5b depicts the structure of the depthwise separable module.

D. ATTENTION MECHANISM
In deep learning, the attention mechanism forces the network to focus more on certain input parts while suppressing the rest. We adopted the spatial attention [53] on skip connections to enhance salient feature responses and suppress noisy ones before concatenating with feature response from the decoding path. The module combines feature responses from the skip connections and the decoding path to learning gating weights and then applies them to the skip connections feature responses. See Fig. 4 for the structure and operations performed by the spatial attention module. 1 https://pytorch.org/

E. LOSS
We use the multi-class soft dice loss: 124210 VOLUME 10, 2022 where L dice ∈ R is the mean loss across c classes, y true ∈ R c×n×h×w×d is the ground truth, y true ∈ R c×n×h×w×d is the predicted segmentation maps, and is a small value to prevent division by zero.

F. DATA AUGMENTATION
Data augmentation is an effective technique to increase the training dataset, thereby improving model generalization ability. In this paper, we apply data augmentations techniques that are relatively easy to implement and have low computational complexity. Specifically, we adopted the data augmentation scheme of Ellis and Aizenberg [26]. Random Gaussian noise and blurring were applied to input images with a 50% probability per training iteration. Input images were independently randomly scaled on each axis, with a standard deviation of 0.1 and a 50% probability per training iteration. Moreover, images were randomly flipped and translated independently of each direction.

A. DATA AND IMPLEMENTATION DETAILS
We used BraTS 2020 [2], [54], [55]   maps to the BraTS challenge online portal. 2 All the scans in both the training and validation sets were co-registered to the same anatomical template, interpolated to the same resolution (1mm 3 ), and skull-stripped. Our network was implemented in Pytorch 3 using an open source deep learning framework 4 [26]. We used the Adam optimizer with an initial learning rate of α = 1e − 4, which was decreased by a factor of 0.5 every time the validation loss plateaued for 20 epochs and a weight decay of 1e − 3. The batch size was 2. We trained our network on an NVIDIA Tesla V100 16GB GPU. The code for this project is available at https://github.com/tmagadza/partialDepthwiseNet.

B. SIZE AND SPEED
In Table 2, we compare the size and speed of the baseline model and the proposed method. We used the network architecture proposed by Ellis and Aizenberg [26] as the baseline model. All the models were trained for 100 epochs. Our model outperforms the baseline model in all metrics. The proposed model substantially decreases the model size and parameter count by roughly 90% and 70%, respectively. Moreover, it needed lesser time to complete 100 epochs of training. Removal of the attention mechanism barely reduces the computational complexity of the proposed method.

C. ABLATION STUDY
We performed an ablation analysis to determine the performance contribution of each component of the proposed network. We trained each model for 100 epochs on the BraTS 2020 validation set while maintaining all other network parameters constant. To improve segmentation performance on the enhancing tumor, we replaced all enhancing tumor voxels with necrosis if the total number of predicted voxels were less than a threshold of 300 voxels. We refer to the stripped-down version of our proposed model as a baseline. To maintain consistency with other previous works, we only report on metrics computed by the online evaluation platform (https://ipp.cbica.upenn.edu/). Table 3 shows the Dice Similarity Coefficient results on the BraTS 2020 validation set. The performance of the baseline in all regions was quite strong. Adding depthwise separable modules improved the dice scores marginally for the enhancing and whole tumor regions. We observed more gain when we trained the model with 5-fold cross-validation. Adding the attention mechanism decreased dice scores for the enhancing tumor and tumor core regions. However, by reducing the receptive field to 96 × 128 × 80, we observed an interesting boost in dice scores for the enhancing tumor. Applying L2 weight regularization to the proposed model resulted in good segmentation performance in all tumor regions. Moreover, there was an increase in performance by creating an ensemble of 10 models ( 5 single models + 5 models resulting from 5-fold cross-validation) aggregated by hierarchical majority vote. Table 4 reports the performance of the proposed network as measured by the Hausdorff distance (95%) metric.  Interestingly, our proposed model trained with small input patches outperformed all models, including the ensemble of 10 models in all tumor regions. We observed a reduced Hausdorff distance in tumor core regions due to attention mechanism and weight regularization. The ensemble of the model did not yield many expected benefits save for the tumor core regions only.  Table 5 reports on the dice similarity score performance of our models trained for 100 epochs against previous methods using the BraTS 2020 dataset. The online evaluation platform computed all metrics. No single model outperformed all methods in all metrics. Our model ensemble performed better than the method proposed by Wang et al. [5] overall and in both the whole tumor and the tumor core regions.   Table 6 gives an aggregate summary of the performance of our methods in terms of 95% Hausdorff distance (mm) against previous methods. Again, no single method outperformed all methods in all regions. An ensemble of 11 models by Yuan [57] achieved the best performance overall. Our single model trained with small input patches performed well on this metric again. Specifically, It outperformed the ensemble of 25 models by Isensee et al. [56] in both the enhancing tumor and tumor core regions. It also performed well in the tumor core region as compared to the method by Wang et al. [58].

V. DISCUSSIONS
Accurate and reproducible segmentation of brain tumors is paramount for an effective treatment plan and diagnosis. Deep learning methods have shown promising results as compared to the inter-rater agreement. While several stateof-the-art automatic brain tumor segmentation exists in the literature, most focus on improving segmentation results at the cost of high computational complexity. Some works tried to incorporate techniques known to enhance network efficiency, like residual learning [32] in their design. We believe more emphasis should place on efficient model design as well. A competitive and lightweight model will result in cost savings in the long run. For example, the HPC Cluster 5 we use to train the model poses a 12h limit for each job. Moreover, every user falls under a Principal Investigator who applies for CPU-h resource allocation for their research programme. Thus, one would prefer the best accuracy under a limited computational budget. Table 2 clearly shows that our method needs less time to train and requires just 26MB of disk space. Often the best-performing models are an ensemble of multiple models, which will result in more bandwidth utilization if the trained weights are to be moved to another location. For example, the nnU-Net model 6 used by Isensee et al. [56] to win the BraTS 2020 Challenge, comprises 25 models, which amount to 2 Gig in compressed form. In real-life situations where the model is trained is not usually where it will be deployed. For these reasons, we have proposed an efficient network incorporating the depthwise separable modules to reduce the model size and the parameter number while improving training and inference speed. Specifically, we replaced the convolution blocks of the bottom three layers of the U-Net structure with depthwise separable convolutions. We evaluated the performance of our network on the BraTS 2020 dataset. Results show that our model significantly reduced the model size and the number of parameters by more significant margins than the baseline model (as shown in Table 1).
As for the segmentation results, our model performed poorly in dice scores for the enhancing tumor. This is a common problem [30] that may be caused by an intratumoral class imbalance since LGG images do not have an enhancing region. One way of addressing the issue is to replace the enhancing tumor with necrosis if the prediction of enhancing tumor class is less than a certain threshold [6]. In Table 4, we observed substantial improvement in the Hausdorff distance (95%) score in all tumor regions when we trained our proposed model with small patch sizes. Moreover, qualitative inspection of randomly selected predictions on the training 5 https://www.chpc.ac.za/ 6 https://zenodo.org/record/4003545#.Y1emJHZBzcc VOLUME 10, 2022 set (see Fig. 7) reviews that our model sometimes gives highly accurate segmentation and, on the other, performs poorly. The use of model ensemble [2] is known to mitigate the problem.

VI. CONCLUSION
This paper proposes an efficient model for brain tumor segmentation using partial Depthwise Separable Convolutions. Our proposed network partially replaced some convolutional blocks in a standard U-Net structure with depthwise separable blocks. The experimental results on the BraTS 2020 dataset show that our methods could achieve comparable results with the state-of-the-art methods with minimum computational complexity. Additionally, we have provided an extensive computational analysis of current methods. In the future, we will explore the fusing of multiple resolutions to capture long-range dependencies to improve segmentation performance.