A squeeze U-SegNet architecture based on residual convolution for brain MRI segmentation

This paper proposes an improved brain magnetic resonance imaging (MRI) segmentation model by integrating U-SegNet with fire modules and residual convolutions to segment brain tissues in MRI. In the proposed encoder-decoder method, the residual connections and squeeze-expand convolutional layers from the fire module lead to a lighter and more efficient architecture for brain MRI segmentation. The residual unit helps in the smooth training of the deep architecture, and features obtained from residual convolutions exhibit a superior representation of the features in the segmentation network. In addition, the method provides a design with more efficient architecture, fewer network parameters, and better segmentation accuracy for brain MRI. The proposed architecture was evaluated on publicly available open access series of imaging studies (OASIS) and internet brain segmentation repository (IBSR) datasets for brain tissue segmentation. The experimental results showed superior performance compared to other state-of-the-art methods on brain MRI segmentation with a dice similarity coefficient (DSC) score of 0.96 and Jaccard index (JI) of 0.92.


I. INTRODUCTION
Since Magnetic resonance imaging (MRI) offers a high level of contrast and resolution, it has been popularly employed in many clinical settings to study and examine the human brain [1]. In each of these tasks, it is important to segment the brain tissues automatically into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). However, it is challenging to segment these tissues accurately because the brain contains complex structures as well as heterogeneity in the composition of tissues due to noise, bias field, and partial volume effects in MRI [2]. Further, most of the datasets do not provide labels for a large number of training samples [3]. In addition, classifying the dataset demands an expert in this domain, making the process expensive and time-consuming. To solve these issues, strategies using deep learning, particularly convolutional neural networks (CNNs), are highly preferred to segment and classify the image because of their associated benefits. Some of the advantages of CNN are explained as (i) application-specific activation functions can overcome the training problems, (ii) various dropout techniques can aid in neural network regularization, and (iii) many optimization methods can be used for efficient training of the CNN models [4][5]. Kong et al. [6] proposed the use of deep learning for automatic tissue segmentation, where MRI images were pre-processed with the wavelet multi-scale transformation, and then segmentation was performed using CNN. Shakeri et al. [7] proposed the method of segmenting objects present from natural images using a fully CNN (FCNN). This FCNN-based method showed improved results by describing the output of the CNN as the potential of the Markov random field, whose topology is similar to that of a volumetric grid. Dolz et al. [8] used a three-dimensional CNN architecture to segment subcortical brain MRI structures and well-managed computational complexity and memory requirements during inference. However, in many cases, research interest has been focused on the classification task of assigning single labels or probability values to the output. As a result, effective classification models were explored, and their performance was analyzed on large datasets such as ImageNet [9]. In contrast, semantic image segmentation models have been employed with small variations in their architectures. Some of the FCNN variant models, such as SegNet [10], U-net [11], U-SegNet [12], U-net++ [13], and CE-net [14] perform better than conventional segmentation methods.
Aside from segmentation accuracy, model size, inference time, and energy consumption are also important factors to be considered for many embedded applications [15]. Small CNN networks allow on-chip storage of the model that consumes minimal energy while restoring parameters from memory during model training. In contrast, an off-chip memory consumes a hundred times more energy with more latency in computing operations. The number of parameters in a network layer is determined by the size of the kernel, the total number of filters, and the number of input channels to the filter. Depthwise separable convolutions used in SqueezeNet [16] include design strategies using 1 1 and 3 3 filters to decrease the number of learnable parameters. In addition, the dynamic range of the data can be lowered from 32 bits to 8 or 16 bits through quantization, thus reducing the model size. Inspired by SqueezeNet, we designed an efficient deep CNN architecture for brain MRI segmentation that can deliver improved accuracy with minimal system parameters. We propose an improved U-SegNet model by integrating U-SegNet with fire modules and residual convolutions, which leads to significantly reduced computation time while maintaining the segmentation accuracy for brain MRI segmentation. The stack of residual convolution blocks in the proposed method solves the degradation problem by connecting shallow layers to deep layers through short skip connections, enabling an end-to-end flow of information during the learning stage. Furthermore, fire modules are employed instead of convolution units in the encoder and decoder layers, resulting in fewer network parameters. Our contributions in this paper are summarized as follows.  The proposed combination of the fire module and the residual connection is a novel extension of the existing U-SegNet model. The residual connections in the network enable the transfer of important information from the previous layer to the subsequent layers in a stable way.  The convolution layer in both the encoder and decoder paths of the conventional U-SegNet is replaced with fire modules in our proposed method, which leads to the considerable reduction of parameters and computational complexity.  Our proposed patch-wise residual-based squeeze U-SegNet model can increase segmentation accuracy and make a substantial improvement in comparison with existing methods.
The rest of the paper is organized as follows. Section II discusses related works. The proposed approach and its architecture are discussed in detail in section III. Section IV contains the experimental conditions, comparison studies, and extensive analysis of the proposed method. The conclusion is presented in section V.

II. RELATED WORKS
Accurate segmentation of brain components such as GM, WM, and CSF in MRI is critical for performing a quantitative assessment of various brain tissue and further studying the intracranial volume. Recently, CNN has attained exceptional performance in the field of visual recognition, being considerably applied owing to its robust and nonlinear feature extraction capabilities. FCNN-based semantic segmentation models, such as SegNet [10], U-net [11], U-SegNet [12], U-net++ [13], and CE-net [14], provide better accuracy for natural image segmentation. The architecture of SegNet [10] consists of encoder and decoder blocks. The key idea of SegNet architecture is to up-sample the low-resolution feature maps into the original image space with the use of pooling indices. Because of the pooling indices, the model has faster convergence. However, SegNet may lose many fine details while downsampling the input image, thus resulting in lower segmentation accuracy [12]. Further, the U-net [11] is used for a wide range of applications in medical image segmentation and has an architecture similar to that of SegNet, except for the usage of skip connections from the encoder layer to the corresponding decoder layer instead of pooling indices. Although the U-net uses entire feature maps copied from the encoder layer to the corresponding decoder layer and produces better accuracy, the process requires significant memory occupancy owing to the copying of feature maps from the encoder to the decoder. Considering the strengths of SegNet and U-net model designs, a novel combined architecture referred to as U-SegNet [12] was introduced. Using SegNet as the base architecture, U-SegNet combines the best features from both U-net and SegNet. For improved performance, a skip connection is created between the encoder and decoder paths. Because the indices are passed to the decoder from the encoder, faster convergence is achieved. However, the pooling operations in these models can make the feature representations invariant to small changes, causing gradient problems. In U-net++, dense skip connections were modified by Zhao et al. [13] to allow flexible fusion of features in the decoder and provide better results over U-net scale-limiting skip connections [11], which only allow feature maps of the same scale to be merged. A small disadvantage in U-net++ is the fact that dense interconnections increase the number of parameters [17]. In addition, deep supervision is used to compensate for the decrease in segmentation accuracy caused by pruning. A deep residual model was proposed in [18] to solve the gradient problem. The residual convolutional method considers short skip connections in their architectural design and utilizes identity mapping to obtain a smooth training process. Our previous work [19] proposed an improved U-SegNet architecture by introducing a multi-scale guided input, with multi global attention module at the encoder and decoder paths of the conventional U-SegNet [12]. The multi-scale input features at each encoding layer combine the global and local contexts. In addition, the proposed global attention at the encoder and decoder can increase segmentation accuracy for MRI data by filtering out unnecessary information. Cheng et al. [14] used a pre-trained ResNet module to create a context encoder network (CE-net) in a feature encoder. CE-net combines newly designed dense atrous convolutions with ResNetmodified U-net model and multi-kernel residual pooling to identify higher-level features and gain additional spatial information. Although these approaches capture targets on varied scales and measurements, the context dependencies are uniform for all image areas and are not adjustable. As a result, the methods fail to distinguish between localized and contextual interpretations for distinct categories. The performance of the CNN models is enhanced in combination with residual neural networks that are finetuned on large datasets [20]. In residual neural networks, each layer is fed into the next layer by utilizing skip connections or shortcuts by skipping one or more network layers [20]. The outputs of these short connections are added to the prior stacked-layer output to generate identity mapping. Fig. 1 shows a block diagram of the residual or shortcut connection. In the case of deeper networks, some neurons can become inactive as the network becomes deeper, leading to inefficient transfer of information. Hence, residual connections allow substantial details from prior layers to be transported to subsequent network layers without loss of information and prevent vanishing gradient problems [21]. In the early days of the multi-layer perceptron, the network was trained by adding a linear layer that connects the network input with its output [21]. Furthermore, to solve the vanishing gradient problems, some of the intermediate layers were coupled directly to the secondary classifiers [22]. In [23][24], shortcut connections were used for the implementation of intermediate layer responses, gradients, and propagated errors. Hussain et al. [25] introduced the ORED-net architecture design for segmenting eye areas into several classes. They employ 1 1 convolution-based nonidentity residual connections from the encoder to the decoder layer to reduce the loss of information. Ibtehaz et al. [26] created MultiResUNet, an improved version of U-net, where the convolutional layers of the U-net were changed with Inception-like [27] blocks. According to the authors, this technique iteratively reuses spatial characteristics at different scales. Kaiming et al. [21] analyzed the theory behind the working principle of deep residual models. The derivations implied that identity shortcut connections are essential for the smooth propagation of information. Highway networks [23] presented affine transform-based shortcut connections called gating functions to lessen the gradient-based training of the deep networks. These gates depend on the data for better processing and require more parameters. According to recent studies [15], the vast majority of deep neural networks are too parameterized, which leads to redundancy in deep learning networks, resulting in excessive use of memory and computation resources. For lighter models, several compression approaches such as downsizing, factorization, or compression on pre-trained models are employed to these large parameter spaces [28][29]. In the model compression approach, singular value decomposition is usually applied to a pre-trained neural network to obtain lower-order estimations of the parameters [30]. To build the compressed CNN designs, network pruning approaches have been extensively researched, in which parameters below a certain threshold of the pre-trained model are replaced by zeros to produce sparse matrices. In early works [31][32], network pruning was considered a valid technique to decrease network complexity and over-fitting. Network quantization, which refers to a decrease in datatype size from 32-bits to 8 or 16 bits that further compress the pruned network by minimizing the need for the data bits to describe a particular weight [33]. To effectively function on deep compressed networks, Iandola et al. [16] suggested the SqueezeNet architecture to explore CNN models with fewer parameters while preserving the same performance. Conventional methods for medical image segmentation have certain limitations. In SegNet [10], neighbor pixel information is likely to be lost when unpooled from lowresolution feature maps. Skip connections in U-net [11], and U-SegNet [12] architectures result in the merging of two arguably conflicting sets of characteristics, which may generate some inconsistency throughout the learning process and hence have a negative impact on prediction. Although the global attention module in [19] is helpful to filter the nonuseful information, the use of dual attention at both the encoder and decoder would result in excessive filtering of the extracted data, increasing the risk of losing essential features. Therefore, the proposed method aims to gradually decrease the amount of discrepancy by moving towards the shortcut connections. This is due to the fact that residual connections in the proposed approach lead to more processing of the features at the encoder and also fuse them with decoder features to make more accurate predictions at the decoder output. The proposed method uses residual units to enable the learning of a deep hierarchical network. Furthermore, the convolution kernel locally captures information and neglects the correlation of features outside the receptive field [34]. The fire module in the proposed method can obtain a global view of the feature maps and identify the spatial representation with a reduced number of model parameters. Hence, the proposed network achieves more accurate results by incorporating the residual connection even with fewer parameters.

III. PROPOSED METHODOLOGY
Although U-SegNet [12] shows faster and improved segmentation accuracies, blurry and smooth outputs are observed in the up-sampled results, and the network is also insensitive to fine image details. Moreover, brain tissue segmentation is a complex task and requires more precision than natural image segmentation. The network needs to maximally extract features to improve brain tissue segmentation while training on a small amount of data. However, the conventional U-SegNet model has difficulty capturing better features because it performs a large number of pooling operations that generate low-resolution segmentation maps for brain MRI. Moreover, the multiple attentions at the encoder and decoder paths of the improved U-SegNet [19] are likely to lose important features and result in model overfitting. Many attention modules also add a computational burden to model training.
To resolve these limitations, we propose a novel network design in which the U-SegNet is integrated with residual connections, allowing memory (or information) to flow from the earlier to the last layers. In addition, we utilize fire modules that include a squeeze layer with 1 1 convolution filters, followed by an expansion layer with both 1 1 and 3 3 filters, to limit the number of learnable parameters, resulting in a smaller and more efficient model. In addition, the model trained on the comprehensive image information is more likely to lose local features. Hence, to achieve effective feature extraction, each input slice is divided into non-overlapping uniform patches to train the proposed architecture [35]. As a result, when patch-wise inputs are combined with residual connection and fire modules, the proposed method can achieve improved segmentation accuracy while reducing network complexity. Fig. 2 shows the overall outline of the proposed approach. First, we collected the MRI scans with their corresponding ground truths. In general, each scan has dimensions with height width slices (H W ). Each slice is resized to a dimension of 256 256 by adding zeros to the H W. Then, starting with the 10-th slice, 48 slices were selected in threeslice intervals. Every slice was then subdivided into four uniform non-overlapping patches, which were then used for the training of the proposed model. Finally, the model is subject to test input post model training, and the predicted output segmentation maps are generated. The proposed architecture consists of the following components: (i) encoder path, (ii) decoder path, (iii) fire module, and (iv) classification layer. Fig. 3 shows the schematic representation of the proposed architecture, which contains the encoder and decoder paths.

A. Encoder path
The key concept in our proposed model is the use of a fire module to reduce the number of learnable model parameters and the computational burden of the model. We integrated the fire module [16] into our proposed residual U-SegNet for VOLUME XX, 2017 5 segmenting MRI images of the brain. The encoder path contains a series of fire modules, and each of its outputs is concatenated with its input such that residual connections can be made. Considering the input sample with the index of the network layer in the residual CNN (RCNN), the convolution output for the squeeze block is given by (1). * where is the output from the squeeze layer of the fire module and is the weights associated with convolution kernels, where subscript [1 1] represents the size of the convolution kernel for the respective network layer. The squeeze layer in the proposed method is designed to include /4 input channels, where is the number of input channels used in the conventional U-net [11] architecture.
is the bias term, and * represents the convolution operation. Each convolution operation is followed by the standard ReLU activation function (•).
Similarly, the output from the squeeze layer is parallely fed to 1 1 and 3 3 kernels of the expanded layer, each with /2 input channels, and their result is concatenated to produce the fire module output, expressed as (2).
where is the output from the fire module for the lth layer of the network. Concat() is a concatenate function. The fire module output is added with input to form the residual connections. The output of the RCNN block can be calculated using (3).
where + is the element-wise addition representing the residual mapping to be learned. Here, is the input sample of the RCNN block. The sample is used as the input for the subsequent max-pooling or unpooling layers of the encoder and decoder convolutional units, respectively.
, 2 The output of the residual block is forwarded to the maxpooling layer to reduce the dimensions of feature maps and emphasize intricate details of the feature space as (4). The max-pooled residual block is referred to as the encoder unit. The pooling indices are stored during max-pooling and used to unpool the feature maps in the decoder.

B. Decoder path
The architectural design of the decoder is similar to that of an encoder in which max-pooling is replaced with a max unpooling operation to recover the original input. Fig. 3 shows the overall proposed architecture with encoderdecoder paths. The decoder in the proposed method uses fire modules to reduce network parameters. The decoder path also employs fire modules with 1 1 convolution by taking /4 input channels at the squeeze layer. The outputs from the squeeze layer are parallelly computed with 3 3 and 1 1 convolution kernels, where each kernel accepts /2 input channels. Further, the outputs across these parallel convolutions are combined by concatenation to generate the fire module output at the decoder layer. Similar to the encoder, the output from the decoder fire module is concatenated with its input to form a residual block. The residual output is unpooled in reference to pooling indices that are used to localize the feature maps while unpooling. Furthermore, the feature maps from the encoder layer containing the contextual information are concatenated with the respective decoding layer feature maps to form skip connections as (5).
, , where are the pooling indices that are transferred from the encoder to the decoder layer to retrieve the spatial feature information of the feature maps at the decoder. These skip connections include both higher-and lower-resolution feature information and concentrate on the most useful details for up-sampling operations.

C. Fire module
In our proposed architecture, the fire module was adopted to decrease the network parameters in the segmentation of the brain MRI. It was originally used in [16] to reduce the number of parameters for AlexNet [36] while maintaining acceptable classification accuracy. In the fire module of SqueezeNet [16], a 1 1 convolution filter in the squeeze layer is used to decrease the number of channels for the input elements. In the expanded layer, 1 1 and 3 3 convolution filters are used to extract multi-scale information of the input image. Motivated by the benefits of the fire module, the proposed encoder-decoder architecture is integrated with the fire module to reduce the model parameters. First, most of the conventional methods [10][11][12][13][14] use a 3 3 convolution filter, whereas, in the proposed network, the 3 3 convolution filter is replaced by a squeeze layer with a 1 1 filter, feeding into the expanding layer, which is the combination of 1 1 and 3 3 convolution filters, as presented in Fig. 4. The use of a 1 1 filter in the squeeze layer results in nine times fewer parameters than the 3 3 filter used by the conventional methods. Fig. 4-(a) shows the normal convolution layer at the encoder and decoder sides of a conventional U-net [11], with a 3 3 convolutional kernel containing input filters, and accepts the input feature map of dimensions of height width channels (H W C). Fig.  4-(b) shows the structure of the fire modules in the encoder and decoder layers of the proposed method. Second, a smaller number of filters in the squeeze layer feeding into the expanded layer can reduce the learnable parameters in the network owing to the reduced number of connections in the network. As shown in Fig. 4-(b), the squeeze module consists of only 1 1 convolution layer with an output channel equal to /4. The output from the squeeze layer is provided into the expanding unit, which contains two parallel convolutions with kernel sizes of 3 3 and 1 1, each convolution with /2 output channels. Furthermore, the outputs of these parallel convolutions are concatenated to generate the output of the fire module. Hence, the proposed method uses a squeeze layer with fewer filters than the total number of filters in the expand layers, resulting in a considerable reduction in the overall network parameters. Finally, in several conventional methods, the layers in the network have larger strides (>1), and most network layers have small activation maps. In our work, we reduced the stride of the convolution layers to one, which creates larger feature maps in the network, thus resulting in higher segmentation accuracy.

D. Classification layer
In the final decoder layer, 1 1 convolutional layers are combined with a softmax activation to predict output segmentation maps. There are four types of predicted classes: GM, WM, CSF, and background. For a given input image, the proposed model creates a respective learned representation (6). Further, the input image is categorized into one of four output classes using this feature representation. The proposed model losses are measured using the cross-entropy loss (7). The softmax layer receives the decoder representations and interprets them into the output class by assigning the probability score ′. The output of the softmax layer can be obtained using (6) for c classes.
and the network cost function is calculated using the crossentropy loss function as in (7). , ∑ log , where the ground truth and predicted distribution score obtained for each class of i are and , respectively.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
Two sets of MRI brain images were used to evaluate the proposed method. The first study used 416 T1 weighted brain MRIs gathered from the Open Access Series of Imaging Studies (OASIS) database [37] at Washington University, which contains information about both non-demented and demented subjects. Out of the 416 subjects, 150 were selected for our study. Of the selected data, the first 120 subjects were used to train the model, and the remaining 30 subjects were used as test datasets. We trained and tested our proposed network using MRI slices in three planes: axial, sagittal, and coronal. There were 208 176 176 input axial scans in OASIS, with each scan containing 176 slices. It has been discovered that the distinguishable tissues of the volume are often found near the middle slices [34]. This same information is often communicated through successive layers. Thus, in order to eliminate these non-informative slices and decrease the repetition of successive training, a set of 48 slices was chosen for evaluation, starting from the 10th slice with a three-slice interval between each slice. In order to resize the extracted slices to the dimensions of 256 256 48, 24 pixels with zeros were inserted at the top and bottom and 40 pixels with zeros at the left and right edges of the image. Similarly, the sagittal and coronal planes of MRI slices were also resized to 256 256. As a result, each input scan was made up of 48 slices of 256 256 dimensions. Each slice of the MRI scan and its associated ground truth segmentation map was divided into uniform patches during training. There were 256 256 input slices, and each slice was divided into four patches. As a result, the proposed model had dimensions 128 128 for each partitioned patch. The training model was applied to these patches, and the test data was used to predict segmentation results. We also used MRI images from the Internet Brain Segmentation Repository (IBSR) database [38]. A total of 18 T1-weighted MRI scans were collected from 14 healthy men and four healthy women between the ages of 7 and 71. The datasets in the IBSR were pre-processed with skull stripping, normalization, and bias field correction [38]. The training dataset includes 12 subjects with manual annotations and ground truth labels, while the remaining six were used to test the model.  Table I. The proposed network was built using an NVIDIA GeForce RTX 3090 GPU for training and testing, and stochastic gradient descent was employed to improve the loss function. We used a learning rate of 0.001, a high momentum rate of 0.99, and a total of 10 epochs for training. The proposed work was implemented using the Keras framework. Fig. 5 and Fig. 6 show the segmentation results for OASIS and IBSR datasets in three planes: axial, coronal, and sagittal, respectively. From the figures, the proposed method for GM, WM, and CSF of brain MRI achieves  well-segmented performances on both datasets. It is noted that the central slices of the axial plane in the brain MRI contain most of the information than the other two planes. Thus, the segmentation results in the axial plane can achieve the most effective performance. The highlighted regions in Fig. 5 and Fig. 6 demonstrate that the results on the sagittal and coronal planes show accurate predictions. From the findings of Fig. 5 and Fig. 6, it can be inferred that the proposed approach can effectively extract detailed feature patterns for all three planes. Quantitative metrics were used to evaluate the efficacy of a proposed architectural design.
DSC values [39] and JI [40] are the typical metrics to compare input ground truth maps with that of output segmented maps. As shown in Table II, DSC is defined as the sum of the number of elements in each set divided by the number of common elements shared by both the sets, where |X| and |Y| represent the cardinalities of the ground truth and predicted segmentation sets, respectively (i.e., the number of elements in each set). JI is described by means of the DSC, as shown in Table II. Both DSC and JI measures are used to determine the fit between the predicted segmentation map and the related ground truth segmentation map.
Segmentation performance was also evaluated by the mean square error (MSE), which refers to the square difference between the original X and the predicted Y. The Hausdorff distance (HD) [41] was used to calculate the dissimilarity of two sets in metric space. Here, the HD is computed between boundaries of the ground-truth and predicted segmentation maps, where the smaller value of HD shows that the predicted output is nearly identical with the ground truth, indicating lower segmentation error.   Table II, where D is the Euclidean distance between two pixels, and R and C are the image height and width, respectively. To evaluate the comparative segmentation results, we performed experiments for SegNet [10], U-net [11], U-SegNet [12], U-net++ [13], CE-net [14], ORED-net [25] and MultiResUnet [26] models under the same experimental conditions. In Figs. 7 and 8, in which comparisons of the segmentation results are provided, the proposed method showed better quality segmentation maps compared to other conventional methods.
In [35], it was reported that the segmentation accuracy of Unet, SegNet, and ORED-net significantly decreased for complex image textures. As can be seen in Figs. 7 and 8, the results show that in U-net, SegNet, and ORED-net architectures, fine details are missing, compared to those of the proposed method. Because the network layers in U-net++ are connected by a series of nested, dense skip paths, leading to redundant feature learning, they could not perform well.
In particular, it can be observed from the feature maps by U-net-based variation models such as U-net, U-Net++, MultiResUNet [26] in Figs. 7-(c) and Fig. 8-(c), whose highlighted areas concentrate specific tissues, that misclassification occurred. Although U-SegNet with a combination of indices and skip connections yielded better segmentation results, it failed to capture fine details. As shown by the red outlined boxes in Fig. 8, it is observed that U-SegNet failed to identify differences between WM and GM tissues, and most of the GM tissues were incorrectly predicted as WM. For the segmentation of medical images, the CE-net uses a context encoder block to retrieve multiscale information. However, because the context encoder module is present only at the bottleneck stage of the model, the multi-scale feature details may be rendered irrelevant as it arrives at the last decoder layer for the task for classification. The proposed method aims to overcome these difficulties and employs residual connections to successfully propagate useful information through the network layers and ensure the capture of fine details. In addition, uniform patches further make the performance of the proposed method better by directing their focus to related areas and capturing better feature representations. The improved segmentation result for the proposed method is shown in Fig.  7. Similar findings were seen for the IBSR dataset in Fig. 8. It can be inferred from the visible results of Figs. 7 and 8 that our proposed method may effectively retrieve coarser and local segmentation features while avoiding distractions across tissue boundary regions. The quantitative analysis of the proposed method was performed by comparing with the existing SegNet [10], Unet [11], U-SegNet [12], U-net++ [13], CE-net [14], OREDnet [25] and MultiResUnet [26] methods; the results are presented in Table III. Our proposed network achieved an improvement of 10%, 3%, 2%, 2%, 1%, 3%, and 2% points on average, in terms of DSC over the SegNet, U-net, U-SegNet, U-net++, CE-net, ORED-net, and MultiResUnet methods respectively, and attained an MSE value of 0.004, which was lower than that of the conventional methods. The lower performances could be attributed to the fact that SegNet [10] stores the max-pooling indices, that is, the positions of the maximum value of the feature in every pooling window are learned for each encoder map and used for low-level feature maps for up-sampling; thus, translation invariance is often compromised. In contrast, U-net employs skip connections as the backbone of the architecture, blending deep and coarse data with shallow and fine semantic information [11]. U-net has a minor drawback in significant memory requirement to store lower-level features in the up-sampling process for further concatenation [5]. The atrous convolution coupled with max-pooling of multiple kernels in CE-net helps to capture multi-scale data without redundant data. However, the CE-net's ability to extract multi-scale features is confined to the bottleneck layer, resulting in very low feature representation at the final decoder layer. The MultiResUnet reuses the spatial features across different scales for multi-resolution analysis and obtains the dice similarity accuracy of 94%. Further, the ORED-net employs non-identity residual-based skip pathways to reduce the information loss through the network layers resulting in 92% of Dice with 0.005 MSE. However, the drawback with these models is that the MultiResUnet shows limited generality, hence tends to over-segment and make false predictions. Further, the simple 1 1 convolutionbased non-identity residual in the ORED-net would not be efficient enough to prevent the information losses between encoder-decoder layers. Furthermore, owing to the pooling layers in the encoder step, the segmentation maps created by these current networks provide low-resolution outputs. As a result, the pooling layers must be eliminated in order to  Fig. 7. Qualitative comparison for GM, CSF, and WM using the proposed method and existing methods for OASIS dataset. From left to right: original input image, ground truth, SegNet, U-net, U-SegNet, U-net++, CE-net, ORED-net, MultiResUNet and proposed method, respectively.  Fig. 8. Qualitative comparison for GM, CSF, and WM using the proposed method and existing methods for the IBSR dataset. From left to right: original input image, ground truth, SegNet, U-net, U-SegNet, U-net++, CE-net, ORED-net, MultiResUNet and proposed method, respectively .  TABLE III  THE RESULT OF THE BRAIN MRI SEGMENTATION FOR THE PROPOSED METHOD COMPARED WITH THE CONVENTIONAL METHODS ON OASIS AND  IBSR DATASETS   OASIS   Axial plane  Methods  WM  GM  CSF  DSC  JI  HD  DSC  JI  HD  DSC  JI  maintain a high spatial resolution. However, SegNet, U-net, U-SegNet, U-net++, CE-net, ORED-net, and MultiResUnet models might not learn holistic features from images without pooling layers since convolution is a relatively local operation. Our proposed method with a residual-based network, which is combined with a fire module, can be a potential solution to the aforementioned problems and produce improved segmentation accuracy. The input convolved with a 1 1, 3 3 kernel concatenated together in the fire module can capture the global context without reducing the resolution of the segmentation map. This rich information obtained from the fire module, combined with its input through residual connections, results in enhanced feature representation. As a result, global information can be shared across layers without sacrificing resolution, and blurring in segmentation maps can be reduced. Thus, the   selective integration of spatial information through uniform patches, feature maps followed by residual connections help efficiently capture context information. Furthermore, to demonstrate that a network has very few learnable parameters while maintaining the same accuracy, we proposed the use of a fire module. Fig. 9 illustrates how the learnable parameters and computation time were calculated. Smaller models can be made by constructing a series of fire modules, each of which contains a squeeze layer with only 1 1 convolution filters. The 1 1 filters in the squeeze layer downsample the input channels and reduce the parameters before sending them to the expansion layer. The 1 1 filters in the expanded layer combine channels and perform cross-channel pooling. The 3 3 convolution filters in the expanded layer determine the spatial representation. As a result, by combining these two separate-sized filters while running on lower parameters, the model becomes more descriptive. The fire module lowers the computational burden by reducing the parameter map and making a smaller CNN network with improved accuracy. Although the proposed method includes more layers in comparison to conventional methods, it results in 0.5 million lesser parameters with respect to U-net and requires an identical number of parameters as the U-SegNet model. The total number of parameters in our proposed method is 4 million, which is seven and two times smaller than that of the CE-net and ORED-net architectures, respectively. The proposed method takes 1.5 hours to train the model, which consists of almost the same training time as the U-SegNet and OREDnet model. It is 68% faster than CE-net and provides improved accuracy.

A. Ablation study
We performed an ablation study on the proposed method integrated with different convolutional units to investigate the influence of each selection on the segmentation performance as follows: (i) forward convolution unit, (ii) fire module-based convolutional unit, (iii) residual convolutional unit, and (iv) fire module with residual convolutional unit (proposed method). All of these convolutional units were implemented considering U-SegNet [12] as the base network. The first model consisted of forward convolutions with ReLU activations at both the encoder and decoder sides and represented the conventional U-SegNet [12] model. The second model was formed by replacing the forward convolution with a fire module, which included squeezed and expanded layers, as shown in Fig. 10b. The third model was formed by concatenating the forward convolution output with its input forming residual connections, as shown in Fig.  10c. The combination of both fire modules and residual connections was used in the U-SegNet network, which is the proposed method. Table IV reports the performance of these four different convolution units in terms of the DSC score, the number of model parameters, and the time required for training and testing. The simple forward convolution model showed an overall DSC score of 92.35% with 4 million parameters, requiring 1.6 hours of training. The same model replaced with fire module showed 0.5% increased accuracy, with 5.5 time's reduced learnable parameters and 50% reduced computation time. Contrary to forward convolution, the shortcut connections were constantly active, and the gradients could readily backpropagate through them, which resulted in better accuracy. However, this residual-based model had 21 million parameters and consumed 7 hours to train and test the model, which is four times higher than the time required by the baseline U-SegNet [12]. Although better accuracy was obtained with residual connections, it is a nonideal selection for image segmentation tasks due to a large number of parameters generation and high time complexity. To overcome this problem, we proposed to use fire modules in the residual network. The fire module-based network demonstrates a significant reduction in the number of learnable parameters as well as a reduction in model training calculation time while retaining network accuracy. The fire module with 1 1 convolutions in the squeeze layer helped to reduce dimensionality and aided in faster model convergence, and the expanded layer combined with 1 1 and 3 3 helped to prevent eventual gradient losses caused by the squeeze layer. The network, in collaboration with fire module and residual connections, showed an overall DSC score of 96% with an improvement of 2.5% compared to baseline U-SegNet [12] with a requirement of only 1.5 h of computing time. Hence, the shortcut connections enabled to transport residual information through each network layer, which may increase segmentation accuracy with reduction of the model parameters and computation time by the fire modules, making the proposed method superior to conventional methods.

V. CONCLUSION
Although state-of-the-art methods provide satisfactory results for the segmentation of brain MRI, they face difficulties when the process involves small variation targets. In this paper, we proposed a residual convolution-based squeeze U-SegNet architecture for brain MRI segmentation. The residual connections help in better propagation of the feature information through the network layers. Hence, the proposed model, in combination with both residual and fire modules, exhibits better performance with fewer network parameters when compared to conventional methods. Furthermore, uniform input patches were able to capture fine local details. Our proposed network provided the best DSC value of 96%, with the least MSE of 0.004. Experimental results show that a network with patch-wise input, combined with residual convolutions and fire modules, produces an effective and efficient brain MRI segmentation model.