HTTU-Net: Hybrid Two Track U-Net for Automatic Brain Tumor Segmentation

Brain cancer is one of the most dominant causes of cancer death; the best way to diagnose and treat brain tumors is to screen early. Magnetic Resonance Imaging (MRI) is commonly used for brain tumor diagnosis; however, it is a challenging problem to achieve higher accuracy and performance, which is a vital problem in most of the previously presented automated medical diagnosis. In this paper, we propose a Hybrid Two-Track U-Net(HTTU-Net) architecture for brain tumor segmentation. This architecture leverages the use of Leaky Relu activation and batch normalization. It includes two tracks; each one has a different number of layers and utilizes a different kernel size. Then, we merge these two tracks to generate the final segmentation. We use the focal loss, and generalized Dice (GDL), loss functions to address the problem of class imbalance. The proposed segmentation method was evaluated on the BraTS’2018 datasets and obtained a mean Dice similarity coefficient of 0.865 for the whole tumor region, 0.808 for the core region and 0.745 for the enhancement region and a median Dice similarity coefficient of 0.883, 0.895, and 0.815 for the whole tumor, core and enhancing region, respectively. The proposed HTTU-Net architecture is sufficient for the segmentation of brain tumors and achieves highly accurate results. Other quantitative and qualitative evaluations are discussed, along with the paper. It confirms that our results are very comparable expert human-level performance and could help experts to decrease the time of diagnostic.


I. INTRODUCTION
Brain tumor segmentation plays a crucial role in the diagnosis and planning of cancer treatment. Gliomas are the most general principle for brain tumors. It can be classified as Low-Grade (LGG) and High-Grade Gliomas (HGG). LGG tumors are less aggressive, while HGG tumors are malignant, growing, and rapidly invading surrounding tissues [1]. Patients with high-grade gliomas, even under treatment, do not live on average for more than 14 months after diagnosis [2]. Possible treatments include surgery, followed by or combined with chemotherapy and radiotherapy [3]. Specialists can use ultrasound, computed tomography (CT), or MRI for patient screening. MRI is especially helpful for Gliomas assessment The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . as extra information can be extracted. Common stages of MRI screening are fluid-attenuated reversal (FLAIR), T1-weighted (T1), T1-weighted contrast-enhanced (T1c), and T2-weighted (T2).
Manual segmentation requires a great deal of time and effort, and it is a subjective process. Automatic or semiautomatic methods are therefore necessary [4]. The main challenges of brain tumor segmentation are that it can occur in almost any form and size anywhere in the brain, and the tumor has a low contrast to the surrounding tissue. Deep learning-based techniques have outperformed traditional methods. One of the most common effective segmentation methods is the fully convolutional neural network (FCN). Zhao et al. proposed in [5] a technique for the segmentation of the brain tumors by integrating FCN and Conditional Random Fields(CRF's). Badrinarayanan et al. [6], presented SegNet, a deep convolutional model for semantic segmentation presented SegNet, a deep convolutional model for semantic segmentation consists of an encoder network and a corresponding decoder network followed by a pixel-wise classification layer. Recently, in the field of image segmentation, convolutional neural networks (CNNs) based algorithms have accomplished excellent success [7]. In this area, in particular, U-Net-based models [8] are used extensively. The U-Net architecture is a fully convolutional model consisting of encoding and decoding parts. The U-Net model offers several advantages [9] for the segmentation process, such as its capacity to simultaneously use the context and global place needed to generate a good segmentation map and can use very few training samples and provide better segmentation results. The contributions of this paper can be summarized as follows.
• We propose HTTU-Net architecture to address challenges of brain tumor segmentation which occur in almost any form and size anywhere in the brain; also tumor has a low contrast to the surrounding tissue and address class imbalance problem. HTTU-Net not only extracts more semantic information but also gives more consideration to the information of small-scale brain tumors, which improves the segmentation of brain tumors.
• HTTU-Net is based on the excellent achievement of U-Net based architectures. Our technique is trained using multimodal image patches. It also updates the U-Net network by adding batch normalization at the end of each block to reduce the mean and variance problems and stable the layers. Our architecture, the first track, focuses on the tumor's form and size while the second track captures the contextual information. Each track consists of a different number of convolution blocks and uses a different kernel size to handle the different tumor sizes.
• We have introduced a new hybrid loss feature, combining Focal Loss and Generalized Dice Loss functions, to mitigate the class imbalance.
• We demonstrate that the proposed strategy improves the precision of the initial U-Net and also alleviates the issue of overfitting. We experiment with Brats 2018 dataset, and our architecture shows superior performance.
Section II presents a brief survey of the related work. In Section III, we describe the material and methods. Section IV shows the experimental results, and the discussion will show in Section V Finally, conclusion and future work are provided in Section VI.

II. RELATED WORK
Several CNN-based approaches for brain tumor segmentation have been suggested in the literature, and significant progress has been made [10]. Pereira et al. [11] proposed a 2D CNN network with a small kernel size (i.e., 3 × 3). They trained two distinct models, one for HGG and another one for LGG. They also used a max-pooling layer of stride two and applied a dropout to the dense layers only. The model utilizes the activation function of Leaky rectified linear units (LeakyReLU) [12]. A two-stage cascade network was also implemented by Havaei et al. [13] that contains a local path and a global path. This architecture can encode contextual characteristics, both locally and globally. Zhao and Jia [14] presented a multiscale CNNs model, through which not only local and global features are learnt, but also complementary information from various MRI image modality is combined.
Le and Pham [15], suggested full-convolution U-Net networks acquire features from a multimodal MRI training dataset and then apply Extremely Randomized Trees (Extra-Trees) to segment the abnormal tumor cells. Wang et al. [16] used three binary classifiers, one for each task, to develop a triple-cascaded brain tumor segmentation architecture. Consequently, they can decrease over-fitting and make it simple to train. However, one limitation of the cascade architecture is that it is not end-to-end, and it takes a long time to train. Dong et al. [17], established a deep convolution network based on U-Net for automatic tumor segmentation, which was evaluated using data set from BRATS' 2015. This approach uses the Dice loss function. In [18], Chen proposed a separable 3D U-Net architecture using the prevalent U-Net structure, dividing each 3D convolution into three branches in a parallel. He and Fang [19] proposed three pathways U-Net structure to segment the brain tumor; each modality is processed in a single pathway. Table 1 summarizes the datasets, methods, and results of the most similar related works on brain tumor segmentation based on deep learning methods with our work where Enhancing Tumor (ET), Whole Tumor (WT) and Tumor Core (TC).

A. DATASET AND DATA PREPARATION 1) DATASET
The BraTS' 2018 [20] contest provides a large training set of 210 HGG MRI scans and 75 LGG MRI scans. The size of each MRI scan is 240 × 240 × 155, and each case has FLAIR, T1, T1-enhanced, and T2 volumes. The dataset is co-registered, re-sampled to 1 mm3, and skull-stripped. The N4ITK bias correction method [21] is applied to MRI volumes to alleviate non-homogeneity and intensity variations. Each slice is then normalized using the slice's mean and standard deviation. In this work, to reduce the impact of the class imbalance issue, we perform data augmentation that comprises rotation, translation, horizontal and vertical flipping [22]. During the training, 2D patches of 128 × 128 × 4 size are sampled randomly to decrease computational time, and all zero-intensity patches from the training set will be removed [13].

1) U-NET BASED DEEP CONVOLUTIONAL NETWORKS
We will briefly explain the original U-Net framework in this subsection. The U-Net architecture can be regarded as an auto-encoder where there are several contraction blocks in the contraction part. Each block includes two 3 × 3 convolutional layers followed by a 2 × 2 pooling layer, and it uses the rectified linear unit (ReLU) activation function. It doubles the number of feature maps from one block to the next to be able to effectively learn the complex structures, increasing the number of feature maps from 1 to 1024. Each block in the expanding part consists of two 3 × 3 convolutional layers followed by a 2 × 2 up-sampling layer. In order to maintain symmetry, the number of feature maps is reduced by half from each block to the next. There is a collection of stacked convolution/pooling layers in the contracting or downsampling part, whereas the expansion part consists of up/transposed convolutional layers. Skip connections are used in U-Net to append the feature maps of the corresponding contraction layer at the end of each expansion block. These connections guarantee that the contracting part features will be used in image reconstruction, and the final layer is a 1 × 1 convolutional layer to produce the segmentation results.

2) TWO-TRACK U-NET ARCHITECTURE
Severe class imbalance is one of the main problems of brain tumor segmentation, with healthy vowels accounting for 98% of the total vowels, necrosis, edema, and non-enhanced represent 0.18% 1.1% and 0.38% respectively. We designed and improved U-Net architecture in this article to solve the problem, consisting of two distinct tracks. These two tracks use small and large convolutional kernels and follow the same encoder/decoder structure. These tracks are defined as the first track and the second track. The receptive field size of the first route is 3 × 3, and it has five convolutional blocks, while the second track uses a 5 × 5 convolutional kernels and four convolutional blocks. The proposed twotracks architecture improves the original U-Net architecture. We apply a batch normalization after each convolution block to preserve regulated gradient levels, accelerate convergence, and minimize the effect of inner shift covariates so that the network parameters do not alter rapidly during backpropagation. The Leaky Rectified Linear Unit (LeakyReLU) activation function with 0.1 leakage factor is used in this architecture. Fig 2 shows the suggested architecture. In the following subsections, the details of the two tracks will be described, and their parameters will be shown in Table 1. The motivation behind the design of this model is to improve the efficiency of the original U-Net model, process different tumor sizes, and reduce the issue of class imbalance. The details of each track are explained in the remainder part of this section.

a: THE FIRST TRACK
The first track's contracting part consists of 5 convolutional blocks. Every block has two convolutional layers. We use the  LeakyReLU activation function in the proposed architecture and carry out batch normalization at the end of each block. For all convolutional layers, this track utilizes a 3 × 3 kernel. The amount of filters for the first, second, third, fourth, and fifth blocks is 64, 128, 256, 512, and 1024. At the end of each block, a max-pooling layer is used to reduce the size by half. Similarly, the expanding part consists of 5 blocks each block starts with a deconvolutional layer with a kernel size of 3 × 3 and a stride of 2, doubling the size of feature maps, thus increasing the size of feature maps from 8 × 8 to 128 × 128. Table 2 shows the parameters for each layer.

b: THE SECOND TRACK
The second track's contracting part consists of 4 convolutional blocks. Each block has two convolutional layers and followed by batch normalization (BN). We use the Leaky ReLU activation function and 5 × 5 kernel for all layers in this track. The amount of filters for the four blocks is 64, 128, 256, and 512. Similarly, the expanding part includes 4 convolutional blocks of convolutions, and each block begins with a deconvolutional layer of 5 × 5 kernel size with a stride of 2 and doubling the size of feature maps. The size of feature maps increases from 16 × 16 to 128 × 128. Finally, we concatenate the output of the first track with the output of the second track, and this concatenation is convolved and then followed by a soft-max function.

C. MODEL TRAINING
Multimodal image batches of size 128 × 128 × 4 are used to train the proposed model. A batch of size 4 is used, and the stochastic gradient-based (SGD) [23] optimization algorithm is used for parameter optimization.
We set the momentum = 0.8. The initial learning rate = 0.0001 is decayed exponentially with decay factor 0.1. We trained the model for 60 epochs because the validation loss has not improved after that. Because the loss functions play an essential role in improving the model accuracy, we implemented a hybrid function as described in the following section.

1) HYBRID LOSS
Ultimately this work illustrates how important the choice of the loss function can be in a deep learning system, mainly when dealing with highly unbalanced problems. The selection of loss functions is also improved model accuracy. Accordingly, we use a hybrid loss function that combines the focal loss function and the Generalized Dice Loss (GDL). The most robust reliability across setups was observed when using GDL. The focal loss function is a modified version of binary cross-entropy and is aimed toward low-confidence labels also achieves state-of-the-art accuracy and speed.

a: THE GENERALIZED DICE SCORE (GDS)
Reference [24] proposed as a multi-class segmentation estimation method. As shown in equation (1); where L is the total number of labels, k denotes the batch size, W i represent the weight assigned to the i th label. As proposed in [21], we set W i = 1 k g ik . p ik and g ik representing the value of the ( i th , k th ) pixel of the segmented binary image and image of binary ground truth.

b: THE FOCAL LOSS
For the segmentation task, we select a multiclass focal loss [25] as shown in equation (2) : where (1 − q k x,y,z ) γ is a modulating factor, The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted. The setting of γ >0 can reduce the relative loss for well-classified samples and putting a focus on hard and misclassified samples. In contrast, the focal loss is equal to the original cross-entropy loss when γ = 0 (we found γ = 2 to work best in our experiments). And p is the model's estimated probability for the class. The proposed hybrid loss HL function is shown in equation (3):

A. EVALUATION METRICS
We evaluate the segmentation results using four metrics; Dice similarity coefficient (DSC), Hausdorff distance (HD), Sensitivity, and Specificity. The Dice similarity score measures the similarity between the ground truth and the resulted segmentation mask. It is defined in Equation (4): where TP, FP, and FN are the number of true positive, false positive, and false negatives, respectively. Sensitivity is also defined as a true positive rate as in equation (5). Specificity is also used to calculate the true negative rate as in equation (6).
Hausdorff Distance indicates the tumor border segmentation quality. It calculates the maximum distance between any two surfaces, A and B, as in equation (7).  where a and b are the set of points in A and B, respectively. And, d(a,b) is Euclidean metric between these points [26].

B. PERFORMANCE ON BraTs'2018 TRAINING DATASET
In our experiments, 160 subjects from Brats training dataset is used for training and 40 subjects for validation purposes. We extract 25,000 multimodal patches from each case to form the 4,000,000 patches training set. We carry out our experiments using Keras framework TensorFlow backend [27]. The training was carried out on an Intel Corei7 3.5GHz machine using NVIDIA GeForce GTX 1070. The segmentation results for eight cases are shown in Figures 3 and 4.  Table 3 and Specificity, and Hausdorff distance in Table 4.The quantitative results presented in this section indicate that the proposed HTTU-Net model produces accurate segmentation results. It achieved a Dice similarity coefficient of 0.852, 0.812, and 0.741 for the segmentation of the whole tumor, core tumor, and enhancing tumor, respectively.

C. PERFORMANCE ON BraTS'2018 TESTING
60 subjects from BraTs 2018 dataset are used for testing purposes. We present the Dice similarity and Sensitivity metrics for these cases in Table 5 and the Specificity and Hausdorff distance in Table 6. In Table 5, the method obtained a mean   ET, WT, and TC dice score of 0. 745, 0.865, and 0.808, respectively.
In Table 6, average HD scores of 4.43, 7.53and 8.81 for ET, WT, and TC, respectively, were obtained. The evaluation of our algorithm's performance on the Brat's 2018 validation is presented in Fig 5. We can observe that performance is consistent in both training and testing cases, indicating that this model is well generalized to invisible examples.   Table.7 shows a comparison between the performance of the proposed HTTU-Net model, the original U-net architecture, the first track, and second track models. Boxplots of Dice dispersion and Sensitivity are shown in Fig 6 and Specificity, and HD scores dispersion boxplots are shown in Fig 7. The boxplots plots report the minimum, lower quartile, median, upper quartile and maximum. Points that fall outside the interquartile range are considered outliers. It is evident from the boxplots that in most cases our algorithm produces significantly good segmentation accuracy.

V. DISCUSSION
The precise segmentation of gliomas has drawn considerable interest from medical doctors and researchers as a critical component of tumor detection, treatment preparation. Since manual segmentation of tumor regions is exhausting, and time ravage, it is important to develop effective computational methods for the segmentation of automatic brain tumor. So, most deep learning methods are proposed for brain tumor segmentation to solve the class imbalance problem. The suggested method provides more reliable output in segmentation than most previous methods. Accordingly, the results Fig. 3, and Fig. 4 indicate that the size, shape, location, and intensity of tumors in these eight samples are different, and also enhance the segmentation performance for small tumor regions. Generally, the results of the proposed segmentation architecture are comparable to those acquired by the experts (GT). In Table 5, Table 6. It is also observed that the small enhancements on the testing based on Brats 2018 training set are due to the reality that it contains more HGG tumor samples than LGG tumor samples. In Table 7. we use two tracks with small and large convolutional kernels sizes to investigate the impact of different parameters; the result reveals that the HTTU-Net model has achieved the best performance among all compared models. It increases 0.055 for ET, 0.013 for WT, and 0.014 for TC values over the original U-Net, it also increases 0.006for ET, 0.015 for WT, and 0.008 for TC values over the first track model and increases 0.013 for ET, 0.006 for WT, and 0.016 for TC values over the second track model. Due to the multi-modality nature of MRI and memory limitations of the current GPU,it is worth mentioning that in the proposed method, the training stage is time-consuming, which could be considered as a limitation. The training time for one epoch is around 10 min, but also the prediction phase rapidly processes the testing dataset.

VI. CONCLUSION AND FUTURE WORK
In this paper, we introduced an automatic approach for brain tumor segmentation using 2D HTTU-Net architecture. The proposed technique has been quantitatively evaluated using the BraTS'2018 dataset. It contains two tracks; each one consists of a different number of convolution blocks and uses a different kernel size to handle the different tumor sizes. We also developed a new hybrid loss function to alleviate the class imbalance problem by combining the focal loss and Generalized Dice Loss functions. Higher performance is achieved through HTTU-Net architecture, which solves brain tumors segmentation problems that can happen anywhere in the brain, in almost any type and size. The evaluation of the proposed approach verifies that our results are very comparable to those obtained manually by experts. In the end, a future work possibility may concentrate on 3D HTTU-Net. Engineering, Chiba University, Japan, funded by the Egyptian Government and the Japan Society for the Promotion of Science (JSPS). He has published many conferences and journal papers on topics related to medical image analysis, computer vision, and machine learning. He has participated in many international scientific competitions in medical imaging. His research interests include image/video analysis and understanding, computer vision, virtual reality, surgery planning, artificial intelligence, and machine learning. VOLUME 8, 2020