Exploring the U-Net++ Model for Automatic Brain Tumor Segmentation

The accessibility and potential of deep learning techniques have increased considerably over the past years. Image segmentation is one of the many fields which have seen novel implementations being developed to solve problems in the domain. U-Net is an example of a popular deep learning model designed specifically for biomedical image segmentation, initially proposed for cell segmentation. We propose a variation of the U-Net++ model, which is itself an adaptation of U-Net, and evaluate its brain tumor segmentation capabilities. The proposed approach obtained Dice Coefficient scores of 0.7192, 0.8712, and 0.7817 for the Enhancing Tumor, Whole Tumor and Tumor Core classes of the BraTS 2019 challenge Validation Dataset. The proposed approach differs from the standard U-Net++ model in a number of ways, including the loss function, number of convolutional blocks, and method of employing deep supervision. Data augmentation and post-processing techniques were also implemented and observed to substantially improve the model predictions. Thus, this article presents a novel adaptation of the U-Net++ architecture, which is both lightweight, and performs comparably with peer-reviewed work evaluated on the same data.


I. INTRODUCTION
Brain tumors may be defined as abnormal growths of cells within the brain [1]. The 2020 Statistics for Adolescents and Young Adults [2] estimate 3700 cases of brain cancer, being the most common cause of death for men in this age group (10-39 years), and second largest cause of death overall after female breast cancer. The 2020 GLOBOCAN Cancer Statistics [3] estimate close to 19.3 million cancer cases worldwide, with close to 10 million deaths. Brain and nervous system cancers accounted for over 300,000 new cases, with 250,000 new deaths in 2020.
Magnetic Resonance Imaging (MRI) is a frequently used imaging method for diagnosing and monitoring brain tumors. The analysis of MR images may be categorized by the degree of user involvement. The work in [4] leverage this method The associate editor coordinating the review of this manuscript and approving it for publication was Fahmi Khalifa . of categorization and classifies techniques as being manual, semi-automatic, and fully-automatic brain tumor segmentation approaches.

A. MOTIVATION
According to [4], the clinical use of segmentation techniques generally depends on the simplicity of the approach and the level of interaction a user has with the system. Experts' level of trust in automated systems is another contributing factor. Thus, some medical institutions may favor manual segmentation over techniques which may appear complex and require extensive training. Manual brain tumor segmentation is a tedious process which requires analysts to manually trace the region of interest (ROI) on MR image slices, using software tools with sophisticated graphical user interfaces [4].
Manual segmentation is time consuming and also susceptible to human error such as inter and intra-operator variability, as shown in [5]. The latter work shows that maintaining a consistent manual segmentation strategy is difficult, even on the same MR image. Nonetheless, [6] claim that manual segmentation techniques are still carried out at a number of institutions. An automatic system for brain tumor segmentation could minimize the drawbacks of human error and be invariant to external factors such as distractions and the mental state of the practitioner.
Current research has produced some capable automatic systems, as discussed in Section III-B. Thus, an individual developing an automatic segmentation system in present times should not focus solely on producing a model which learns the segmentation task and performs it automatically. Effort should also be invested in providing improvements such as adjusting the model's architecture to consume less resources, making it more accessible to practitioners and researchers alike.
Datasets for researching brain tumor segmentation have also become more widespread owing to competitions such as the Medical Image Computing and Computer Assisted Intervention (MICCAI) Multimodal Brain Tumor Segmentation Challenge (or BraTS) [7]- [11]. An example of a BraTS data sample and model prediction of the corresponding brain tumor are shown in Figure 1. Further explanation of BraTS and the BraTS datasets are provided in Section II-A.

B. AIM AND OBJECTIVES
The main aim of this paper is to create a model which takes multimodal 3D MR images as input to automatically generate a prediction of the corresponding brain tumor. The model output is also compatible with standard MR viewers. This goal was achieved by following the below objectives: • Surveying state-of-the-art methods at the time to produce a unique approach with results adequate for a clinical setting.
• Devising a model which works automatically, not requiring any user feedback or input for training and prediction.
• Adapting the U-Net++ [12] model architecture and identifying performance changes when modifying its features.

II. BACKGROUND
This section presents an outline of the data and the deep learning model architectures discussed in this article. The U-Net model [13] and its many adaptations are a considerable inspiration for the model presented in this paper. One can also observe that the number of U-Net models submitted to the BraTS challenges increased significantly in recent years of the challenge [14]. Thus, the background for each of the models presented in this work is also provided, followed by an introduction to the metrics used to evaluate each of the models.

A. MICCAI BraTS
The MICCAI BraTS challenge is a competition hosted by the Center for Biomedical Image Computing and Analytics (CBICA) at the University of Pennsylvania. The BraTS challenges identify and showcase state-of-the-art techniques for brain tumor segmentation. The datasets distributed by the competition organizers consist of real world data in the form of multi-institutional routine MRI scans, manually segmented by multiple board-certified neurologists [9]. The scans are split into high-grade gliomas (HGG) and low-grade gliomas (LGG) and provided in the T 1 w, T 1 ce, T 2 w, and FLAIR modalities. The individual sequence types make the dataset more robust owing to the different strengths of each MR image modality. T 1 -weighted (or T 1 w) sequences display fluid and water-based tissues as mid-grey whilst fatty tissue has a high intensity [15]. Contrast agents applied to T 1 w images produce T 1 ce images, which enhance the intensity of highly vascular tumours [15].
T 2 -weighted (T 2 w) images are visually opposite of their T 1 w counterparts, as fluids are now the brightest feature, and fat, water-based tissues are mid-grey [15]. Finally, FLAIR sequences are a variation of T 2 w images, where the cerebrospinal fluid (CSF) within the brain and any tissues with a similar T 1 value are suppressed from the scan [15]. A sample of each sequence type taken from the training data used in this study is shown in Figure 2. The BraTS data's multimodal nature allows competitors to devise segmentation approaches which are robust to the MRI sequence type. Since the data are also obtained from multiple institutions, this makes competition submissions also viable in real-world scenarios. In this paper, the 2019 challenge datasets were leveraged, as explained in Section IV-A.

B. U-NET AND RESIDUAL U-NET
Image segmentation problems present an additional layer of difficulty compared to more standard image/object recognition problems such as scene classification. In the latter problem, a model would learn to take images of scenery as input and produce one class label for the entire image. Predicting a class label for the entire scan would be sufficient only e.g. when detecting whether an image contains a pathology, rather than identifying its location and extracting the tumor. In image segmentation, every pixel (or voxel for 3D images) will be assigned a class. This requires more complex feature extraction to be performed by a model. Moreover, due to the spatial resolution of these images, care must be taken not to encumber a network with too many parameters. This is the main inspiration behind the U-Net Convolutional Neural Network (CNN) [13].
The U-Net model is split into two halves, forming its synonymous 'U-shape'. In the first half of the network (or 'encoding' path), an increasing amount of salient information from the input images is extracted at each level of the encoder. This is done by downsampling the input image and simultaneously doubling the size of the feature maps. The second half of the network (or 'decoding' path) performs the opposite function, restoring the size of the image whilst reducing the resolution of the feature maps.
Skip-connections connect both halves of the network via concatenation layers, which combine the information extracted from the encoding path with the data in the decoder. U-Net exhibited a model capable of performing biomedical image segmentation whilst maintaining a low number of parameters. The model performed well enough to achieve first place in the International Symposium on Biomedical Imaging (ISBI) challenge for segmentation of neuronal structures in electron microscopic stacks, by a considerable margin.
An important adaptation of the U-Net architecture is the residual U-Net. A notable variation was proposed by [16], who developed a U-Net model which used element-wise additions to combine the input and output of the convolutional blocks at each level of the first half of the network. The model also used small kernels and zero padding in its convolutions, and replaced max pooling with strided convolutions. Deep supervision [17], [18] was also employed in the decoder half of the network, where secondary segmentation maps were generated at each level of the decoder and combined using element-wise additions. Isensee et al. [19] would adapt [16] using a smaller batch size, double the filter map resolution, and a multi-class weighted Dice loss function as submissions for BraTS 2017 and BraTS 2018.

C. U-Net++
Another U-Net adaptation was proposed in [12], who proposed a model which made use of dense blocks within the U-Net architecture. The standard encoder-decoder structure of U-Net was maintained, however this was combined with additional upsampling layers along the skip-connections between the encoder and decoder halves of the network. This builds upon the convention of standard U-Net where a concatenation connects the encoder to the decoder at each level. The motivation behind this was to address the semantic gap between both halves of U-Net prior to concatenation [12]. The work by [20] combined U-Net++ and Half-Dense U-Net [21], which also shares properties of dense networks [22] and standard U-Net [13]. In [20], the combination of both networks was done specifically to target difficulties in combining low-level and top-level features in convolutional neural networks.
The U-Net++ architecture allows for the concatenations to become increasingly refined at higher levels of the decoder part of the model. The standard U-Net architecture presented in [13] only upsamples layers from the decoder, following concatenation via skip-connection. U-Net++ maintains these layers and also includes further upsampling operations at every level of the first half of the network. This creates structures similar to smaller U-Nets within the model. The end result is the combination of U-Net's architecture with more complex skip connections. In theory, this results in the combined benefit of lower parameters from the U-Net model with the rich feature space of dense networks. Moreover, [12] also made use of deep supervision along the first skip pathway, which produces full resolution segmentation maps.
Whilst the increased complexity of the model implies a correspondingly larger architecture, [12] claim that the number of parameters is quite similar to the original U-Net [13], and a wide variant of U-Net which uses larger feature channels. This comparison is also made on the grounds that the same number of convolutional kernels are used in both models. In [12], a comparison between U-Net++ and the standard, wide U-Net was computed using the Jaccard Index (also known as Intersection over Union or IoU). The resulting scores showed that U-Net++ outperformed standard and wide U-Net by an average of 2.8 to 3.3 points of IoU.

D. EVALUATION CRITERIA
The criteria used to assess the model's performance closely follow the metrics used by the BraTS challenges. Namely, the predictions are evaluated on the basis of their Dice Coefficient, Sensitivity, Specificity, and Hausdorff Distance (95 th percentile) across all three of the Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET) target classes. The Dice Coefficient and Hausdorff Distance metrics calculate the model's segmentation performance in terms of how closely the predicted tumor classes reflect the ground truth images. Regarding the use of the 95 th percentile of the Hausdorff Distance, this was likely intended to avoid skewing the scores in case an extreme outlier exists in a model's predictions. The sensitivity and specificity measurements calculate the capability of the model to minimize false negatives and false positives being predicted by the network.
There are various reasons why these metrics were kept as the final evaluation criteria for this paper. Firstly, the validation data are provided without ground truths, consisting only of the 125 multimodal patient volumes for BraTS 2019. Thus, evaluation is only possible on the CBICA BraTS web portal where a system impartially evaluates submissions against VOLUME 9, 2021 ground truths stored on the site. The portal then generates evaluation results which make use of the aforementioned criteria. Secondly, this process provides a common framework for model evaluation which allows for accurate comparison with other research, both for competition submissions and any alternative peer-reviewed works.

III. RELEVANT LITERATURE
This section will present a summarised timeline of brain tumor segmentation techniques, ranging from classical machine learning techniques such as clustering and support vector machines, to more modern approaches such as deep neural networks and U-Net adaptations. Particular emphasis was placed on techniques evaluated on different years of the BraTS challenges, especially for the deep learning approaches. The main reason for this is that it provides insight into how different methods performed on the BraTS data as a somewhat collective framework. Moreover, the data was becoming increasingly refined with every iteration of the challenge.

A. CLASSICAL MACHINE LEARNING TECHNIQUES
Initial research on brain tumor segmentation mainly consisted of several supervised and unsupervised machine learning (ML) approaches, as stated by [23]. Since datasets at the time were scarcer, data acquisition was more scattered, making it difficult to assess work since most studies would be using different datasets without a common evaluation technique. Nonetheless, unsupervised techniques proved useful in this time when unlabelled data was common as they did not rely on having high quality ground truth annotations accompanying an MR image dataset.
Clustering is one such technique which was frequently used for brain tumor detection and segmentation. The work by [24] explored the use of K-means clustering for tumor detection in MRI. The approach involved several stages, namely converting grayscale MR images to RGB, and then to CIELAB format, which makes use of chromaticity and luminosity coefficients. This approach was also used by [25] in their study using an 'intuitionistic' version of FCM. [26] later compared the performance of K-means, Fuzzy K-means, Gaussian Mixture Model (GMM), and Markov Random Field (MRF) on the GBM samples from the BraTS 2013 Test dataset. The best results for this study were obtained by the MRF approach, scoring 0.72, 0.62, and 0.59 Dice Coefficient scores for the WT, TC, and ET.
Supervized ML techniques such as Support Vector Machines (SVM) were also popularly used for brain tumor detection and segmentation. An example of SVM applied to this domain is the work by [27] who made use of a one-class SVM with an initial user seed point for the tumor used as input to the SVM, obtaining a percentage accuracy of 83.5% on 24 slices across 5 patients. A more recent approach by [28]  Random Forests are also a popular classifier for MR image segmentation. The implementation by [29] made use of a GMM combined with a 2-stage Random Forest, obtaining Dice scores of 0.87 and 0.78 and 0.74 for the WT, TC, and ET for BraTS 2013. [30] later combined Random Forests with texture features for supervoxel classification, with positive results for BraTS 2013.

B. DEEP LEARNING TECHNIQUES
The popularity of classical machine learning methods and unsupervised approaches has waned in recent years, with the current trend shifting towards robust deep networks [23]. When examining submissions to the most recent iterations of the BraTS challenge, the main methods are mostly CNN variations, as showcased in [14]'s survey of BraTS competition submissions. In recent years, deep learning methods such as CNNs are preferred over the clustering and machine learning approaches seen in initial years of the competition.
A notable CNN implementation for brain tumor segmentation was proposed by [31], who made use of separate CNN pathways, one for HGG cases and the other for LGG, with different architectures and normalization configurations for each path. This research is also notable for its use of small convolutional kernels, inspired by [32]'s research on VGGNets. For image pre-processing, [33]'s bias field correction was implemented alongside an algorithm developed by [34] to standardise values across all sequences. [31]  A 3D CNN for brain tumor segmentation named 'DeepMedic' was proposed by [35] for the 2015 and 2016 iterations of the BraTS challenge. The images were normalized by subtracting their mean and dividing by standard deviation. The CNN used was 11-layers deep, using two parallel-processing pathways at different resolutions. Small kernels were also used as in [31]. [35] also made use of residual connections in a new model extending DeepMedic, named 'DMRes'. The performance of both models was evaluated on BraTS 2015 and 2016, and for 2015 DeepMedic obtained a Dice coefficient of 0.89, 0.75 and 0.72 for the WT, TC, and ET classes. DMRes performed better for the Dice and sensitivity metrics, but saw a slight decrease in precision. DMRes also achieved the top Dice scores for the TC and ET classes of images for the 2016 challenge, when combined with a Conditional Random Field approach. [36] proposed an approach using deep neural networks for brain tumor segmentation. The architecture consisted of two pathways, making use of 7 × 7 and 13 × 13 feature map resolutions respectively. Bias-field correction and normalization were applied to the data for pre-processing. [36] also removed the top and bottom 1% of intensities from the input images. Training was also split into multiple phases to counter the healthy-to-diseased voxel imbalance, using a patch dataset with equiprobable labels. The project was evaluated on the BraTS 2013 test dataset, with competitive WT, TC, and ET Dice scores of 0.88, 0.79, and 0.73.
As discussed previously in Section II-B, two notable approaches using the residual U-Net architecture are the works by [16] and [19]. Both approaches were evaluated on separate BraTS datasets, obtaining very competitive results. Isensee et al. [19] also returned in 2018 with their 'No New-Net' [37] implementation. The latter work featured a very similar model to the 2017 submission, using a more refined pre-processing method and additional input data for training the model. No New-Net finished in second place for BraTS 2018.
Whilst the previous models made use of residual connections, [38] later made use of dense blocks [22] in a U-Net style network with encoding and decoding pathways. The work by [38] was an adaptation of the team's previous semantic segmentation approach named 'DeepSCAN' [39]. The large parameter requirements of the dense DeepSCAN network was the motivation for [38] to integrate U-Net with the system, allowing for a lower spatial resolution within the dense portion of the network to keep the model size reasonable. This approach performed competitively in BraTS 2018, placing directly below No New-Net [37] in third place.
The model which secured first place in the BraTS 2018 challenge was proposed by [40]. The approach featured an encoder-decoder CNN with a variational auto-encoder (VAE) branch. This model works in a similar way to U-Net, with the main difference in this model being how the output of the encoder was split halfway into the mean and standard deviation, which were then used to generate samples from a Gaussian distribution to reconstruct the images prior to the beginning of the localisation process. The approach also used a very large patch size of 160 × 192 × 128, which retained a large amount of the original images' information. The Dice Coefficient scores obtained on the BraTS 2018 testing dataset for the WT, TC, and ET classes were 0.88, 0.82, and 0.77.

IV. METHODOLOGY
Prior to addressing each of the individual processes in the system pipeline, one may identify the entire workflow at a high level. The pipeline implemented and presented in this paper was adapted from a popular brain tumor segmentation online repository, 1 aiming to replicate the implementation by [19]. An adequate understanding of the BraTS ground truth labels, target classes, and data distribution in terms of the HGG-to-LGG split is an essential complement to understanding these steps. Thus, a data definition section is provided prior to the breakdown of each step of the pipeline.
The first step in the pipeline involved pre-processing the input data using bias field correction, cropping, and normalization. With the data pre-processed and ready for training, the next step was to apply one-hot encoding to the ground truths. Once the model was been trained on 1 https://github.com/ellisdg/3DUnetCNN the input MR image volumes and ground truths, the model weights were preserved and used for generating predictions from the validation data, which had been passed through the pre-processing pipeline independently. The final step involved resampling and interpolating the predictions to their initial dimensions, before uploading them to the BraTS web portal for the final evaluation. Each of these steps is explained further in the corresponding sections to follow.

A. DATA DEFINITION
The data for this paper were acquired from the 2019 MICCAI BraTS challenge. At the time of development, the 2019 data was the most robust version from all the challenge datasets, also including the largest amount of multi-institutional post-operative MRI scans. Moreover, an additional validation dataset was included with the training and testing datasets starting from BraTS 2017. It is of note that the BraTS 2019 testing dataset was unfortunately restricted to a 48-hour window during the live 2019 challenge, and not available for academic/research purposes. Nonetheless, since the validation data is an entirely separate set of data from the images used for model training, it is valid for evaluation purposes.
The BraTS ground truth annotations are composed of three main categories, split into labels 1, 2, and 4 in the ground truths. The BraTS target classes ET, WT, and TC are composed of different combinations of these labels, as shown in Table 1. A visual representation of the labels is also shown in Figure 3.    The training data consist of 259 HGG and 76 LGG cases whilst the validation data consist of 125 cases which are not explicitly labelled as HGG or LGG. Examples of the HGG and LGG cases in the training data are shown in Figure 4. The image also shows the inter and intra-categorical differences for both LGG and HGG, exhibiting how even the same class of pathology can have varying shapes and textures.
Both sets of the BraTS data are multimodal, consisting of the aforementioned T 1 , T 1 ce, T 2 , and FLAIR sequence types. The data are available in a compressed Nifti (*.nii.gz) file format and categorized by case ID. Some samples were maintained from previous years, with all images manually segmented by multiple expert board-certified neuroradiologists [9].

B. PRE-PROCESSING
The input volumes were first passed through N4 bias field-correction [33], using the Advanced Normalization Tools (ANTs) library [41]. The FLAIR volumes were excluded from the bias-field correction process and included in the next pre-processing step with the corrected images. Background removal was then applied to each sample, removing all values between 0 and a relative tolerance parameter (in this case the default value of 1e −8 ). It should be noted that since each scan had a slightly different distribution of non-zero values, the cropping operation produced new images with different resolutions, which would not be viable for model training.
The images were thus resampled and interpolated to 128 × 128 × 128, as in [19]. The image resizing steps were also applied to the corresponding ground truths, with the exception that nearest neighbour interpolation was used for the ground truths to avoid including values outside of the predefined BraTS labels. Finally, z-score normalization was used, transforming the input images to have zero mean and unit variance, using the formula in Equation (1), where x and x new refer to the original and normalized samples, with µ and σ referring to the mean and standard deviation of the corresponding entire dataset.
Following normalization, cropping, and resampling the images, the next step was training the model to automatically extract the multiclass tumor segments. Since the pipeline follows the process used in [19], the hyperparameters used during training were maintained. Samples were processed one-by-one rather than in batches due to the data's dimensionality. The ground truths were also passed through onehot-encoding, transforming the original images with labels {1, 2, 4} into multiple binary segmentation maps, i.e. one map with values {0, 1} for each of the labels 1, 2, and 4. The training dataset was split into an 80-20 train-test split, resulting in 268 total training steps. Each of the internal models (discussed in Section IV-D) were trained using these parameters, with the training period spanning 300 epochs and using a learning rate of 5e −4 . The optimizer used for the model during training was the Adam gradient descent [42] optimizer. To handle the class imbalances present in the data, the multi-class adaptation of the Dice loss devised by [19] was used, as presented in Equation (2).
Here, K refers to the 3 ground truth labels and Y ,Ŷ refer to the images of the ground truth and model prediction respectively. The divisor coefficient and summation outside of the main function modifies the standard Dice loss to handle multiclass evaluation, and α refers to a smoothing constant with a value of 1e −5 . One should note that training was largely carried out on Google Cloud and split between two server instances. The initial machine made use of a Tesla K80 GPU with 12 GB of virtual memory. A switch was made shortly after to an instance with a Tesla P100 GPU with 16GB of virtual memory. Apart from the increased GPU memory, the compute capability of the Tesla P100 was much higher, allowing for training to complete much faster. Training the final model for 300 epochs took anywhere between 2 to 3 days when using the Tesla P100 GPU, compared to the 6 to 7 day duration when using the Tesla K80. More detailed parameters related to training are shown in Table 2. The tabulated data includes information such as the amount and type of GPU memory, CUDA cores, and exact training times for all of the Data augmentation was also applied during training to produce synthetic samples of the BraTS training images. As stated by [13], the objective of using data augmentation for datasets with limited data is to produce a more robust dataset for the model during training. For this experiment, random permutations of rotations, axes flips, and transpositions were applied to the training batches. Rotations were applied to the images in multiples of 90 degrees, and axes flips were performed on all three of the x, y, and z axes. Transposition in this case refers to the image data matrix being transposed, changing the order of the dimensions.

D. MODELS
Three U-Net based models were built internally following the training process described in Section IV-C. The first of these models takes inspiration from the original U-Net model [13], shown in Figure 5. The encoder part of the standard U-Net model features convolutional blocks composed of two 3 × 3 × 3 convolutions with a standard ReLU nonlinearity function followed by a 2 × 2 × 2 max pooling operation. The small convolutional kernels allow the model to maintain a relatively small number of parameters [32]. These models were built to compare U-Net, Residual U-Net, and the proposed model within the same data processing pipeline and training conditions. This comparison between the models is described further in Section V-C. There are five levels of depth in the network, with the final level being a bridge to the decoder part of the network. Concatenation layers connect both halves of the model at each level apart from the deepest block. Following initial experiments showing that dropout layers with the tested value were not beneficial to the approach, they were omitted from the standard U-Net model. The second internal model follows the residual U-Net architecture devised by [19], shown in Figure 6. This model was built as per the aforementioned Github repository, 2 to be 2 https://github.com/ellisdg/3DUnetCNN consistent with the work in [19]. Some differences between this model and the standard U-Net are the addition of residual blocks and the use of strided convolutions in place of max pooling along the encoder. Although there is no empirical evidence proving that strided convolutions are always superior to max pooling, it introduces the possibility for the model to 'learn' how to downsample the images better.
The residual U-Net model also uses upsampling layers in place of transposed convolutions in the decoder as [19] claim that the latter may produce checkerboard artifacts in the output. The model also makes use of deep supervision, with secondary segmentation maps being generated along the decoder half of the network, using element-wise additions. The objective of this approach is to refine the final segmentation predictions generated by the model. The final, and proposed model is an adaptation of U-Net++ [12], shown in Figure 7. One can observe how the main difference between this model and the standard U-Net architecture is the more complex system of skip connections. Upsampling layers are now also present in the encoder part of the network in U-Net++, propagating information from deeper parts of the encoder up to the topmost layers. Moreover, deep supervision is also VOLUME 9, 2021 present here, however, this time it is placed along the first skip connection. The benefit of this approach with U-Net++ is that the blocks along the first concatenation produce full-resolution segmentation maps, consisting of upsampled feature data from the deeper layers of the encoder.
Since the U-Net++ model is a convolutional neural network, the model parameters learning during training are generated by the 3D convolutional layers, instance normalization, and transposed convolutions. The number of parameters per convolutional layer (standard and transposed) is calculated using the formula in Equation 3: where x, y, and z refer to the convolutional kernel parameters (3 × 3 × 3). d refers to the number of filters in the previous layer, and k refers to the number of filters in the current layer. Figure 8 shows how Equation 3 applies to the proposed model. Combining this information with the model structure shown in Figure 7, the total amount of model parameters increases proportionally with the number of filters. This is primarily reflected in deeper layers in the model, and concatenation layers, both of which have outputs with larger filter sizes. Taking all of the above into consideration, the proposed model's total number of parameters is 4, 516, 700. Whilst this proposed model is heavily inspired by the U-Net++ in [12], there are a number of key differences in the approach presented in this article. One of the principal differences is the convolutional block schema used by the proposed model. The original U-Net++ by [12] uses a horizontal block scheme which resembles the standard U-Net model, with two sets of convolution, batch normalization, and ReLU activations. Following the experiment described in Section V-A4, it was discovered that halving the number of convolutional blocks resulted in comparable results. The main benefit from this experiment was that the number of U-Net++ parameters using our setup dropped from 7.7M to 4.5M. Furthermore, the entire original U-Net++ model architecture presented in [12] totalled 9.04M model parameters. The drop in parameters is substantial, as smaller models with a lesser total of model parameters are less likely to overfit to the input data during training.
Other differences include the loss function, explored in Section V-A3. Our model uses the weighted multi-class Dice Coefficient loss implemented by [19] in their Residual U-Net implementation, rather than the composite binary-crossentropy Dice function used by [12]. The means of implementing 'deep supervision' to refine the secondary segmentation maps also differs from [12]'s averaging or fast-selection approaches. In the U-Net++ model proposed in this article, the secondary segmentation maps are actually combined using element-wise additions, as shown in Figure 7. Initial training runs showed that the model's convergence improved greatly when comparing the model with and without the element-wise additions for the segmentation maps.
Other differences in our approach include the use of instance normalization, as our model only processes 'batches' of individual patients, hence batch normalization would destabilize training. The model also does not make use of dropout layers, and use a starting filter map resolution of 16 rather than 32 as in [12]. Moreover, the convolutional kernels used for segmentation have a resolution of 3 × 3 × 3 rather than 1 × 1 × 1. Whilst this provided only minor improvements in initial training runs, this was maintained for subsequent training of the model.

V. EVALUATION
This section will serve to exhibit the proposed model's performance. A number of experiments were conducted to extend the model and training parameters, and identify any possible improvements to the final results. A detailed description of each experiment is provided in the sections to follow, including a summary of all experiments conducted in this research effort. Following the best model configuration being selected, the final results on the BraTS 2019 validation data were obtained. The results were compared internally with a standard U-Net inspired by the work in [13] and a residual U-Net model [19] architecture. An external evaluation was also conducted against a number of peer-reviewed approaches on the BraTS 2019 validation data.

A. EXPERIMENTS 1) ABLATION STUDY -DATA AUGMENTATION
Data augmentation techniques are used to generate synthetic samples of real-world data to create more input samples for model training. This is generally helpful for training models tasked with solving problems with scarce data, such as biomedical image segmentation. The original U-Net [13] proposal also made use of data augmentation techniques in this regard. To assess whether or not data augmentation was being beneficial to the final model predictions, an ablation study was conducted, comparing two separate training runs. The results are presented in Tables 3 and 4.  Comparing the tabulated scores, one can observe how data augmentation led to a substantial improvement across all categories of the evaluation criteria. The improved Dice Coefficient and Hausdorff Distance scores show that the segmentation performance of the model improved greatly when implementing data augmentation. This may be attributed to the fact that the new synthetic samples generated during training allowed the model to generalise better, improving tumour segmentation on the unseen validation data. The increase in sensitivity shows that the model also performed better in terms of avoiding false negatives. The marginal increase in average sensitivity score may be attributed to the fact that the initial score obtained by the model was already very high.

2) USING UPSAMPLED FEATURES DIRECTLY IN SKIP-CONNECTIONS
The next set of evaluated models were more compact versions of the proposed U-Net++ model shown in Section IV-D.
The networks were work-in-progress models being tested on local hardware. Thus, some minor modifications to the architecture were made to fit the networks on 6GB of GPU memory. These models follow the proposed U-Net++ architecture closely with two minor differences: a) features upsampled from the encoder were concatenated directly along the skip-connection rather than being passed through a convolutional block and additional concatenation; b) only the penultimate secondary segmentation map was used in the element-wise additions to refine the final segmentation result via deep supervision.
Two variations of this model were created. One was trained for 100 epochs as a part of research submitted to the Organization for Human Brain Mapping (OHBM) 2020 Annual Meeting [43]. The other model was trained for 300 epochs and submitted to the IEEE Mediterranean Eletrotechnical Conference (MELECON 2020) Conference [44]. In spite of being simpler variations of the proposed model and being evaluated on a holdout set of the BraTS 2019 training data, both models were accepted by the respective bodies. In this experiment, we compared the results of these models against the final, proposed U-Net++ on the BraTS 2019 validation data, shown in Tables 5 and 6.  The proposed model outperformed both of the other approaches in the majority of the criteria, particularly for the Dice Coefficient. This is the expected outcome seeing as the proposed model is the 'full' version of U-Net++, leveraging the entire arsenal of dense connections and all secondary segmentation maps for deep supervision. This is also the reason why both of the conference models in this comparison have a very slightly lesser amount of parameters. In addition, whilst the OHBM model obtained a slightly higher sensitivity score for the enhancing tumour class, the scores for the whole tumour and tumour core were much less than those of the proposed model.

3) OPTIMIZATION FUNCTION
The next experiment evaluates the function used to optimize the model's training. In this paper, the employed loss function follows the multiclass Dice Coefficient loss proposed by [19] VOLUME 9, 2021 and shown in Equation (2). Nonetheless, since the proposed model is not a residual U-Net as in [19], we decided to also attempt training the model using a function which follows the binary cross-entropy loss used by [12] in the original U-Net++ paper, shown in Equation (4), where BCE ad DSC refer to the binary cross-entropy and standard Dice Coefficient function. Y ,Ŷ refer to the BraTS ground truth and model prediction. K refers to the set of target classes and α is a smoothing constant with a value of 1e −5 . The comparison between both optimization functions is shown in Tables 7 and 8.  From the results, one may notice that the Dice optimization function was superior in terms of raw segmentation, i.e the Dice Coefficient and Hausdorff Distance. The intuition behind this result is that the use of the multiclass Dice Coefficient function in the proposed model allowed for a better overall classification of the tumour segments. Conversely, the binary cross-entropy loss performed better in terms of sensitivity and specificity. We followed the same route as many other works (such as [45], [46]), who prioritise the Dice Coefficient when evaluating models using BraTS data. As a result of this, the weighted multi-class Dice Coefficient function was kept for the proposed model.

4) USING ORIGINAL U-Net++ CONVOLUTIONAL BLOCKS
Research such as [36] claims that in some instances, adding additional convolutional blocks or increasing the filter map resolution did not result in any substantial performance increase in their CNN models. When testing different iterations of the model, one of the main considerations taken into account was the size of the model, in this case the number of model parameters. This was also highlighted by the very long training times for each of the internal models, as shown in Table 2. Taking all of the above factors into consideration, it was decided to test the model using only half of the convolution-normalization-activation blocks as in the original work by [12]. In essence, the goal was to check whether the tradeoff between model parameters and performance would be worth pursuing. The results are shown in Tables 9 and 10. From the results obtained, we can see that the two models obtain near equivalent results, barring the Hausdorff distance measurement. Conversely, the proposed model with the lesser number of parameters obtained a slightly improved average Dice Coefficient. These two results combined infer that the proposed model had a larger segmentation error for the 'worst' occurrence, yet still performed slightly better than the larger model on average, as shown by the Dice Coefficient. In our opinion, the 69% reduction in model parameters of the proposed model is more significant than the minor decrease in Hausdorff Distance and average sensitivity score. Thus, the new block schema with the lesser amount of parameters was maintained.

5) ABLATION STUDY -DROPOUT REGULARISATION
Dropout regularisation is commonly used in CNNs, in an attempt to reduce the possibility of the model overfitting to the training data. The latter process causes the model to only learn the salient features from the training data, rather than being able to generalise for new, unseen samples. In this experiment, we used the original online repository's dropout value of 0.3, with the results shown in Tables 11 and 12.   TABLE 11. Dice coefficient and hausdorff distance comparison between the proposed U-Net++ with and without dropout regularization on the BraTS 2019 validation dataset. Best scores in bold. The results for this particular dropout value show that there was no substantial improvement in terms of model prediction. For this reason, we decided to not use dropout regularization going forward. In our case, experiment prioritisation is the main reason for only having a singular dropout test using a value of 0.3. Thus, additional testing with other dropout values is encouraged, as it may lead others to obtain more positive results. This is also mentioned in Section VI-B.

6) POST-PROCESSING ANALYSIS
The final set of experiments relate to possibilities of improving the model's predictions after training. For every set of predictions uploaded to the CBICA BraTS web portal, a spreadsheet containing the evaluation scores for each patient is provided to the uploader. Some of the result files extracted for previous experiments showed patients with an ET Dice Coefficient score of 0, as shown in Table 13.
A thorough analysis was conducted on the patients with zero-valued ET Dice scores, elaborated further in Section VII-A below. Following the correct criteria for post-processing being identified, the final step was to confirm that the positive scores obtained via post-processing would not serve to diminish any of the other scores. This test was conducted by comparing the quality of the predictions with and without zero thresholding, shown in Tables 14 and 15.   TABLE 14 As expected, the main improvement from this experiment was for the enhancing tumour category, since the post-processing pipeline was built to handle patient cases with an ET score of 0. The recorded improvements are particularly substantial for the Dice Coefficient and Hausdorff Distance, with the best results overall being obtained by the constant threshold post-processing approach, which was thus maintained for the final model.

7) SUMMARY OF EXPERIMENTS
This section presents all of the results obtained from the experiments performed in this paper. The harmonised results for all of the experiments discussed in this section are shown in Tables 16 and 17.
Going through each of the experiments sequentially, the data augmentation was undoubtedly one of the larger improvements applied to the proposed model. The conference models (OHBM and MELECON) exhibited slightly lower scores, mostly owing to the fact they were lesser versions of the proposed U-Net++. The binary cross-entropy Dice TABLE 16. Dice score and hausdorff distance for all experiments performed for this paper. All models after the first make use of data augmentation. 'Baseline' refers to the proposed U-Net++ without post-processing. Proposed model configuration and best scores in bold. VOLUME 9, 2021 TABLE 17. Sensitivity and specificity for all experiments performed for this paper. All models after the first make use of data augmentation. 'Baseline' refers to the proposed U-Net++ without post-processing. Proposed model configuration and best scores in bold.
optimization function implemented in the original U-Net++ by [12] exhibited higher sensitivity scores, yet showed lesser Dice Coefficient and Hausdorff Distance scores when compared to the proposed approach. The convolutional block schema implemented by [12] was also not favoured over the proposed structure, as this provided only marginal improvements at the cost of 69.38% increased model parameters.
The dropout experiment also showed no notable improvements to the overall segmentation performance of the model. Having said this, it could be beneficial to perform further testing with different dropout values. Finally, the post-processing experiment was successful, as the constant voxel thresholding served to improve the model's enhancing tumour segmentation without diminishing performance in other metrics. Following the observations noted in this section, as well as the prioritization of the Dice Coefficient as the main criteria for evaluation, the proposed U-Net++ maintains the data augmentation and post-processing pipelines, the multiclass Dice Coefficient optimization, and lesser amount of convolutional blocks.

B. RESULTS
The BraTS 2019 validation dataset was used to assess the model's performance. The final scores averaged over all 125 patient samples are shown in Table 18. As previously mentioned, [36] discovered that from the 2% of pathological pixels in the scan, over half of the distribution were edema pixels. From the results obtained in Table 18, the scores obtained in the WT category also reinforce this. The whole tumour obtained the highest Dice Coefficient and Sensitivity scores by a wide margin, and it is also the only target class containing the edema tumour section. This is a pattern which is observable throughout other research evaluated on the BraTS datasets. Conversely, the presence of LGG patient cases without a tumour segment and low representation of the ET tumour section in the data may be contributors to the lower scores obtained for this class.
LGG's may also be difficult to classify for a model since they have less than 25% representation in the dataset compared to HGG subjects. Box and whisker plots for the Dice Coefficient and Hausdorff Distance are shown in Figure 9, giving a deeper look into the scores obtained on the evaluation data. Observing the Dice Coefficient results shown in Figure 9, the whole tumour is once again shown to be well represented, and with minor variance compared to the enhancing tumour and tumour core. The outliers are spread out for all three classes, with the ET segments having the most significant outliers, owing to the known cases with an ET Dice Coefficient of 0, previously discussed in Section VII-A. The median Dice scores for each class are above 0.8. Since the Dice Coefficient represents the segmentation accuracy between the model predictions and ground truths, these values show that the median segmentation performance was a fairly high number.
One observation when comparing the box plots for the Dice Coefficient and Hausdorff Distance is that the distributions for ET and WT change considerably. Nonetheless, these changes and the larger amount of outliers could partially be attributed to the nature of the measure which takes into account the 95th percentile of the largest segmentation error. This is also substantiated by the WT and TC having long whiskers, which suggests that the range of Hausdorff values varies greatly in both cases. The interquartile range for both the Dice Coefficient and the Hausdorff Distance are fairly well contained, which implies that the results are reliable.
Since the ground truths for the validation data are kept on CBICA IPP and not distributed to competitors, it is not possible to visualize outliers directly on MR images. Nonetheless, analysis for correlations may still be carried out from the output files produced by the IPP. Outlier observations for the Dice Coefficient and Hausdorff Distance are shown in Figure 10. The whole tumour is used as an example, as it was found to have the most non-zero outliers. Figure 10 shows that the values of Dice Coefficient outliers vary fairly proportionally with the corresponding Hausdorff Distance. The other observation from the plot is that Dice Coefficient outliers do not necessarily translate to Hausdorff Distance outliers, as only three of the Dice outlier samples were also Hausdorff outliers. As a result, we can confirm that whilst the proportion of values for both metrics is maintained, the outlier sample distribution is quite different.
The sensitivity values of the model are fairly close to the Dice Coefficient values for each class. The specificity is more complicated to draw correlations with, as the values are extremely high, with a very small standard deviation. An expected correlation is that higher sensitivity results in a lower specificity value for the particular class. We once again refer to the box plots for both the sensitivity and specificity to analyse each metric more closely, shown in Figure 11. The first observation made is that the specificity plots' interquartile ranges are near opposite of the sensitivity. It is also noteworthy however, that most of the specificity scores are between 0.99 and 1. High specificity values show that the model is very good at avoiding false positives. This aids in the assumption that the model would be capable of avoiding erroneous classification across classes. Another important question to ask with regard to false positives outside of the target classes is how the proposed system behaves when classifying healthy brains. One may make the assumption that false positives are handled well due to the high specificity. This is more difficult to assess, since the BraTS dataset does not contain any full MRI sequences of healthy brains.
A more visual representation of the results is shown in Figure 12. Both samples in the image were taken from a holdout sample of the BraTS 2019 training dataset (unseen during model training) to showcase the model's predictions against the expert ground truth segmentations. The image shows the segmentation of an LGG and HGG sample from the holdout set. An initial observation from the image is that the model performed the HGG segmentation more accurately than for the LGG sample. This may be a result of the data imbalance in the dataset. The most notably distinct slice is the first image from the LGG sample, where the model falsely predicted the edema as a multi-class segment. The other slices are fairly well classified in line with the BraTS ground truths. Another set of comparisons is shown in Figure 13.  Figure 13 compares the model's performance on five separate patients against expert ground truths. The second and fifth sample were selected specifically as they are interesting cases. In the second scan, one may observe how the tumor structure is quite complex. This may have caused the model to overestimate the enhancing tumor regions in its predictions, although the overall shape of the pathology was maintained. The fourth sample was an LGG patient with no enhancing tumor segment. Whilst the model predicted this correctly, the non-enhancing tumor was overestimated compared to the ground truth. Otherwise, the rest of the samples shown were fairly well predicted by the model. VOLUME 9, 2021

C. INTERNAL EVALUATION
As discussed in Section IV-D, two other models were built internally to assess the proposed approach: a standard U-Net and a residual U-Net variant. These models used the same data, training split, and hyperparameters as the final model. The results of the model comparison on the BraTS 2019 validation data are presented in Tables 19  and 20. Since the post-processing experiments were performed on the U-Net++ model, evaluation of the results without post-processing are also tabulated to avoid any form of bias towards the proposed model.  The main priorities for the selection of the proposed U-Net++ architecture from the tests in Section V-A were the Dice Coefficient and Hausdorff Distance. This is also shown in the internal evaluation, where the proposed model outperformed both the standard U-Net and the residual model. Interestingly, the standard U-Net model obtained the highest WT Dice score and also the highest TC Hausdorff Distance score by a very small margin. This may simply mean that whilst the standard U-Net struggled to predict the ET segments correctly compared to the proposed model, it was slightly better at identifying the edema sections.
The sensitivity scores obtained by the standard U-Net model were nonetheless the highest out of all three approaches assessed in the internal evaluation. This implies that whilst the standard U-Net's predicted tumor sections were not nearly as accurate as the proposed model, it was still better at avoiding classifying false negatives in the MR images. This raises an interesting possibility for future work, as combining both models in an ensemble-like architecture may result in an improvement over the proposed model's sensitivity score.

D. EXTERNAL EVALUATION
The model presented in this paper was also evaluated against peer reviewed work with published results using the BraTS 2019 validation data. The selection of approaches in this table was mostly based on BraTS 2019 publicly available papers, whilst maintaining diversity in the selected approaches. The comparison of all the results are presented in in Tables 21 and 22.   TABLE 21. Comparison between the dice coefficient and hausdorff distance of the proposed approach and some state-of-the-art approaches on the BraTS 2019 validation dataset. Values of '−' refer to unreported data. Best scores in bold.

TABLE 22.
Comparison between the sensitivity and specificity of the proposed approach and some state-of-the-art approaches on the BraTS 2019 validation dataset. Best scores in bold. Some entries removed due to unreported sensitivity and specificity.
The first of the tabulated external approaches by Amian and Soltaninejad [47] makes use of a two-way pipeline. One pathway consists of a standard U-Net which takes the full resolution images as input, whilst the other uses a residual model similar to the approach by [19] on lower resolution samples. This approach was surpassed by the proposed model on all metrics barring the specificity, although this may be due to the rounding used by the authors.
The approach by Wang et al. [48] is the first of the remaining tabulated techniques which surpassed the proposed approach. The pipeline in [48] is similar to the model explored in this paper, making use of a standard U-Net. The main difference explored by [48] is the use of a smart patching strategy with the patch windows being generated depending on an offset from the brain boundaries. The patching strategy results in two separate patching cycles which are composed of brain voxels from the MR images.
The next study by Murugesan et al. [45] uses a more complex pipeline of multiresolution and multidimensional models. These networks are made up of variations of Inception Networks, Residual Inception Networks, and Dense Networks. Each of the three BraTS tumor classes was segmented using a separate ensemble of these networks, combining the networks' output using element-wise addition operations. The work by [45] also made use of a post-processing approach which removed small clusters of predicted voxels. This approach is similar in theory to the post-processing applied on our final model using constant ET voxel thresholding.
Another ensemble method was explored by Hamghalam et al. [49] who made use of a Generative Adversarial Network (GAN) to generate synthetic images from the BraTS data. These 'fake' input samples were used in combination with the data using the FLAIR, T 1 ce, and T 2 w sequence types. Three separate, fully connected networks were used to cater for each of the axial, coronal, and sagittal planes in the MR images. One factor of note in [49] is how the authors omitted the T 1 w BraTS samples from the experiment.
The final work shown in Tables 21 and 22 is the approach by Myronenko et al. [46]. The authors made use of a model which was very similar to the submission in [40] which finished first place in BraTS 2018, as discussed in Section III-B. As per the 2018 submission, the input patch size for this experiment was once again very large, using dimensions of 160 × 192 × 128. The 2020 submission also follows an encoder-decoder model approach, and obtained very high scores across all of the reported metrics, much like the 2018 model.

A. RESEARCH OUTCOMES AND LIMITATIONS
Reviewing the established aim and objectives of this paper, the results show that the model performs segmentation of the multiclass brain tumor segments automatically without any human intervention. Moreover, we have adapted the U-Net++ model in a unique way with a number of experiments which showcase the effects of modifications applied to the architecture and the extent of their improvements or otherwise. An example would be how the ablation study for data augmentation showed a significant improvement across all of the model metrics, and how reducing the number of convolutional blocks came with minimal disadvantages.
One should note that given the size of the models, the training time is substantial. This led to one of the main limitations of the project where the models had to be trained on Cloud instances for the increased virtual memory. This also led to personal costs as no funds were allocated for Cloud services. Nonetheless, this was mitigated slightly owing to the reduced model parameters from using only half of the convolutional blocks. Moreover, the earlier models which leveraged upsampled features from the encoder directly were runnable on local hardware, which also assisted in this regard.

B. FUTURE WORK
There are a number of opportunities to explore when attempting to improve the model's predictions. Starting with the pre-processing pipeline, one possibility would be to swap out the current cropping process with the smart patching strategy leveraged by [48]. The current method follows the aforementioned online repository. 3 The two-phase patching strategy used in [48] could contribute to higher quality input samples for the model, also reducing the possibility of the current cropping method erroneously removing brain voxels from the input MR images. An additional test which could be performed is the inclusion of FLAIR samples in the bias-field correction step of the pre-processing pipeline and checking if this contributes positively to model training.
Moreover, additional pre-processing steps could be followed, such as using the intensity landmark normalization technique by [34], in conjunction with, or instead of the z-score normalization. The use-case for this technique is to address intensity inhomogeneities from separate medical institutions and devices. The z-score normalization itself may also be adjusted, as it is currently performed on the whole dataset, rather than on a per-patient basis. Performing the z-score normalization at an individual case level as in [50] could improve the quality of images' intensity distribution.
There are also possible improvements on the postprocessing side. In the current pipeline, post-processing is applied to correct predictions where scans with no ET segment falsely have some voxels classified as ET by the model. [51] explored an additional post-processing technique to tackle the opposite scenario where scans contained an ET segment which were not predicted by the model. Following the observations in [51], it is possible that the model wrongly labelled ET voxels as peritumoral edema (label 2) in these predictions. An intensity-clustering technique is leveraged by [51] to identify and correct these cases. This would boost the ET segmentation scores of the proposed model substantially, as these cases have an ET score of 0 (out of 1), which reduces the average score on the 125 BraTS validation data samples substantially.
Moving on to the model itself, ensembling is one possible approach which could provide improved results. In this case, we may refer to two different variations of ensembles. Starting with the internal models discussed in Section V-C, one could attempt to ensemble the standard U-Net model with the proposed U-Net++ adaptation, attempting to reap the benefits of both the higher sensitivity and Dice scores achieved by each model respectively. There is also the possibility to train each model either across separate folds (such as five-fold validation), or using multiple separate training runs. Combining the output of each of these models could result in improved segmentation performance.
We may go even further using the proposed model, such as having dedicated networks/paths as in [31] for the HGG and LGG samples. This latter approach could nonetheless be inconclusive given the imbalance between the HGG and LGG images in the BraTS data. The work by [49] discussed in Section V-D also proposed an interesting approach, using a separate network for each of the axial, coronal, and sagittal dimensions of the MR images.
Other improvements could be applied to the proposed model as-is, such as further testing using separate dropout values. Given the training time constraints described in Section VI-A, testing was only performed using the dropout value of 0.3. Exploring other values of dropout with the proposed model could result in reducing overfitting of the model even further, combined with the data augmentation pipeline already in place.

C. FINAL REMARKS
In this paper, we presented an automatic model for brain tumor segmentation with positive results obtained on the BraTS 2019 validation data. The modifications applied to the model architecture make it more compact, being half the size of the original U-Net++ model [12]. This project provides useful contributions to the field owing to the results obtained from the documented experiments. Earlier versions of this model were presented in peer-reviewed conferences [43], [44], that lead to more extensive research that we presented in this paper. This is especially true since these variants of the proposed model were evaluated on an unseen holdout sample of the BraTS 2019 training dataset.
The benefits of the proposed model stem from the complex, yet low-parameter architecture inherent to U-Net and U-Net++, now packaged in a smaller, more accessible model. The aforementioned experiments provide insight to other researchers, explaining how adjusting certain model features or improving input and output image quality using pre-processing and post-processing increased the final scores of the model. The experiments also highlight which modifications resulted in improvements in the Dice Coefficient, and those which favored sensitivity instead.