A Modified U-Net Based Framework for Automated Segmentation of Hippocampus Region in Brain MRI

An accurate localization of the brain anatomical structure for correct and reliable diagnostic strategies is of great concern in many bio-medical applications. Towards this end, manual or semi-automated delineation methods used are found to be time consuming. Herein, to address this problem, we present an enhanced model for automated segmentation of two neighboring small structures of the brain in the Hippocampus region i.e., anterior and posterior. Our aim is to improve the segmentation performance, where the proposed architecture captures contextual information in encoding path and enables precise localization by utilizing the decoding path in a symmetric way. In particular, our proposed methodology enhances the original U-Net architecture with 3-dimensional (3D) data processing and employs spatial elastic deformation. Further, we evaluated the segmentation performance using recursive U-Net for comparison. The effectiveness of different optimization strategies are evaluated on a publicly available data comprising of 3D magnetic resonance imaging volumes from mono-modal hippocampus region. Our experimental results demonstrate the robustness of the proposed model by using patch-based augmentation technique for hippocampal segmentation.


I. INTRODUCTION
A small archi-cortical brain structure, that manages short-term anecdotal and critical memory while depositing it into the long term memory, is known as hippocampus. The hippocampus region is also responsible for vocal-based and musical emotions as it forms a part of the temporal limbic system. Hence, in plain interpretation of voices and musical emotions, the amygdala is particularly involved and enables the hippocampus to process even more complex information. Further, it contributes towards decoding heterogeneous emotions related to music, thereby creating an alliance between memory and contextual information [1]. The human hippocampus can be termed as a folded component of archi-cortex tissue, which is continuous with the neo-cortex [2]. The human brain consists of two hippocampi regions-shaped like seahorses-commonly termed as leftand right-hippocampi. They are also termed as anterior and The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . posterior hippocampus sub-fields. From an imaging perspective, the region shows very little contrast on structural magnetic resonance imaging MRI) scans, due to the existence of nearby anatomical structures such as the thalamus, caudate nucleus, and amygdala. These structures have similar intensity levels in MRI scans to those for the hippocampus region [3].
In neuroimaging, segmentation of the hippocampus region plays a vital role in an early prognosis of certain brain related abnormalities. The gray matter tissue of the temporal cortex-hippocampus, is known to be primarily effected in the very initial stages of Alzheimer's disease (AD). This could further transform to cognitive decline with increasing age. In particular, for diagnosis of AD, the segmentation of hippocampus region is significant and is the most affected part of the human brain. To this end, a noticeable reduction in the hippocampal volume (HV) is a marker for AD diagnosis. There are other specific abnormalities that initially appear in the pre-clinical stages of AD in the hippocampus region such as tau pathology or β-amyloidand [4]. Such significant loss in VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the volume of hippocampus region during the developmental phases of AD corresponds to mental cerebral and emotional decline [5]. During cognitive tests correlates were found binding short-term memory (STM) loss to AD [8]. According to a recent study, the accuracy of hippocampal volumetry via differential diagnostic techniques in normal subjects and MCI/AD patients is improved to a greater extent by normalizing the hippocampus volume with total intracranial volume (TIV) [9].
For the past few years, medical image analysis contributes to one of the popular and active research field in machine learning domain [10]. There is a fair chance that it will be the area where patients might comfortably interact with a completely functioning, practical artificial intelligence systems in the near future [11]. Artificial intelligence (AI) provides great opportunities to assist radiologist in diagnosing and streamlining complicated patterns within images, interpreting physiological characteristics into genetics as well as predicting clinical treatment results and developing strategies for prognostic planning [12]. The data obtained using structural brain MRI could contribute to various medical applications when analyzed using deep learning methods. Particularly, convolutional neural networks (CNNs) have shown remarkably robust results in the segmentation of brain tissues, tumors, and lesions [13]. With the help of a single CNN-based model being extensively trained, different tissues in MR brain images can be segmented. This could be extended to other regions such as the skeletal muscle in chest MRIs, and the blood circulatory arteries in cardiac computed tomography (CT) angiogram. Therefore, CNN based models have performed robustly in tissue classification, and visualisation of physical structures of the human body [14]. One of the peculiarities of CNN-based deep learning models is to provide an overall solution that requires nominal feature extraction with greater generalization and explainability [16]. These are also capable of performing object oriented classification by determining those features that characterize entire image objects unlike conventional classification techniques.
In many biomedical image segmentation tasks, localization is required to be included in the desired output as well i.e., assigning each voxel to a label class. One of the challenges faced is accessing labelled training images in a large quantity. In usual clinical practice, however, either delineation is still done manually by subjective analysis or via some underlying electromedical devices. However, there are many factors that contribute towards this gap between scientific research development and its practical applicability in the clinical routines. This is because of the complications in implementing these methods for practical use. While computer aided diagnosis (CAD) based models are dependant primarily on precise parameter calibration, that is essential in various medical applications. On the other hand, the facility of maintaining different atlases and its registration itself is usually not available to healthcare professionals. Hence, these models cannot be practically utilized to complex clinical applications. Moreover, earlier developed technique of multiple atlases took considerable time in segmenting a single image. To this end, there is a recent focus in research to achieve minimal computation time while performing these segmentation [17]. An accurate performance evaluation of automated methods in application of hippocampus segmentation still poses another obstacle for its practical implementation. In most cases ample number of validation volumes are not used for comparing with a reference segmentation. This is due to the difficulty and time consuming process of obtaining manual segmentation for a large set of volumes. Hence, only limited scientific algorithms have so far been tested on a few bio-medical equipment as well for different acquisition scenarios [18].

A. OUR CONTRIBUTIONS
Herein, we focus on hippocampi segmentation of brain MRIs, where our developed deep learning network has been derived from the famous U-Net architecture proposed by Ronneberger et al. (a standard benchmark algorithm in medical segmentation domain) [19]. Moreover, a comparison of different training strategies has also been performed to precisely locate and segment the actual regions of interest by classifying each voxel in an MR image as non-hippocampus (background or surrounding), left hippocampus, or right hippocampus. We utilized object-based approach to annotate each image pixels with class labels. Our framework includes patch-based segmentation along with dense data augmentation technique, so that during training phase the input images could easily be visualized on multiple scales [16]. Hence we resolve the issue of over-fitting and lower the error rate thereby enhancing the performance of our proposed model. Our main contributions are: • We present a complete in house framework (built in pytorch) for extensive data augmentation including mirroring, cropping, and spatial elastic transformation to the input data and make it compatible to the proposed deep network model.
• We train our models from scratch with optimization strategy for robustness, using combination of weighted multi-class dice loss and cross entropy loss formulation. Further, we used overlap tile method for test data segmentation and evaluation using six different quality metrics for each hippocampus sub-region.
• A significant performance was achieved in terms of accuracy, precision, and recall in hippocampus segmentation.

B. RELATED WORK
An automated segmentation of hippocampus region has gained a lot of significance in scientific and research community. In this regard, different approaches have been proposed and can be categorized in: atlas-based [20], machine learningbased [21], active-contour models [22], [23], and deep learning frameworks.Some of the popular deep learning algorithms include deep Boltzmann machines, CNNs, stacked auto-encoders, and deep neural networks. [24]. Atlas-based registration techniques have gained a significant level of popularity. First, various atlas-based images are registered to the unknown image, usually in a non-linear fashion. The manual segmentation masks are then transformed which further produces individual output segmentation against each of the used atlases. Once individual segmentation results are obtained, the final output is selected by concatenating a number of fusion algorithms [25], such as majority or average voting [26], use of global [27] or local weights [28], joint label fusion [29], and accuracy maps [30]. The active contour model (ACM) strategy evolves itself according to the intensities of the input images. The clear edge boundaries mainly affect the overall performance of the segmented object.
To generate efficient segmentation results, ACMs mostly rely on prior information related to the shape of the structure for better evaluation of the contours [31]. In [23], the model used principal component analysis (PCA) for analysing shapes of both the hippocampus and the neighboring structures along with utilizing atlases being manually segmented. A fusion of multiple atlases based framework with ACM was presented and called 3D optimal local maps (OLMs) [32]. It was developed with enhanced multi-atlas concepts at voxel level, which locally controls the impact of each energy term of a hybrid ACM. In comparison to these conventional methods, different ML techniques for segmentation have been adopted [33], [34]. Usually in these methods, hand-crafted features were extracted from each training sample to make a training data. Further, classifiers were optimized towards generating the desired segmentation masks. In these methods, dictionaries and classifiers are learned simultaneously from a set of brain atlases, which can then be used for the reconstruction and segmentation of an unseen target image. The segmentation accuracy of such methods could be improved if more complicated classifiers are learned rather than linear classifiers [35]. Due to the rapidly increasing evolution in deep learning (DL) techniques for medical image analysis, convolutional neural networks are becoming a methodology of choice [36] for tasks such as the segmentation of brain tissue [37], tumor [38] and lesions [39] anatomical structure segmentation for striatum [40], and for caudate and thalamus [41], [42]. Deep CNN-based sub-cortical nuclei of brain segmentation (including the hippocampus) has been also proposed by [43] and [44]. Additionally, CNNs have also shown remarkable performance for brain extraction [45] and full brain segmentation [46], [47]. It should be noted that for certain specific segmentation problems, various software platforms are available. For instance, for brain segmentation some examples include 'volBrain' by NITRC (Neuro Imaging Tools and Research Collaboratory) [48] and 3D Slicer [49]. However, it has not been possible so far to achieve the reported high levels of segmentation accuracy in actual clinical practice. Further, variations in the anatomical definitions as well as protocol specific issues could not be comprehensively addressed. In other words, we argue that such automatic methods may perform efficiently on user specific data, but may vary from that of the software algorithm developer on the basis of ground truth definitions. Moreover, recent automatic segmentation techniques incorporate expert prior knowledge of lesion appearance, anatomical shape, and other sophisticated high level features as model parameters. These methods are based on specific datasets that vary from the actual radiology imaging data acquired from various hospitals [50].

II. PROPOSED METHODOLOGY
While in MR images, the exact localization of hippocampal structure is a critical and crucial task for correct diagnosis or treatment plannings. Moreover, human dependent or semi-automatic hippocampus region segmentation in 3D images is exhaustive and such methods have low reproducibility in clinical routines. Although, deep learning-based algorithms for automated segmentation have proven to be quite efficient and robust in the medical field [51], but there is still room for extensively evaluating its performance VOLUME 10, 2022 on hippocampi segmentation tasks. Due to the very low intensity contrast of hippocampi structures with respect to the surrounding regions of the brain, the implementation of such frameworks is still a challenging task [18]. Towards this, we propose a deep learning-based method (shown in Figure 1), with details in the following subsections.

A. DATA PRE-PROCESSING
The pre-processing steps included data normalization, padding, and splitting. A cropped image representing the hippocampal brain region was used. Towards this, a bounding box was created over the region of interest (ROI) training volumes and its corresponding labels respectively. The two vector matrices were merged to form the cropped images and their labels, which were then fed to the next step of data normalization. We independently normalized each volume modality of the cropped image region by subtracting the mean and divided with its standard deviation.

B. DATA AUGMENTATION
In our proposed approach, we used random patches from the input images for training the deep learning models. The data were setup by extracting random patches from volumetric images and corresponding pixel label data. The patch-based technique was particularly useful in hardware systems where there is memory constraints while performing dense training from scratch with arbitrarily large volumetric images. In every pair of MR volumes and labels during training, we specified a patch size of 4 × 4 pixels, a mini-batch size of 64, and 16 patches per image from randomly positioned patches. Afterwards, data augmentation technique was applied, which comprised of transformations including mirroring and spatial transform. Hence, for each input, two transformed images were generated. All augmentation operations were applied on-the-fly with our own in-house built framework. We augmented the training and validation images by applying the transform operation to the random patches. In particular, we used random rotation and reflection for the input data to enable robust training. Further, we cropped the corresponding 2D random patches as per network size requirement.

C. DEEP LEARNING MODELS
For segmentation we employed deep learning-based frameworks. First we used the proposed modified U-Net architecture as shown in Figure 2. We further used a recursive U-Net (Rec-UNet) model [52] to compare and validate the results. The training configuration included setting up the hyper parameters and optimizer settings as well as selection of the loss function. The Rec-UNet model comprised of two cascaded stages. In the first stage, a dense-UNet model was used to obtain the initial segmentation results. In the second stage, the resultant segmentation masks of the first stage have been used as prior knowledge by incorporating skip-connection blocks in the sub-module of the existing basic UNet model to obtain more accurate segmentation results. This implementation is done in a recursive way. It is therefore very easy to configure the number of down-sampling steps. Also the type of normalization can be passed as a parameter as instance normalization. A dense Rec-UNet block consists of 'n' consecutive convolution layers, the cascaded block and transition layers for segmentation in MR images of the same resolution, each followed by a batch normalization (BN), rectified linear unit (ReLU), and dropout layers. The succeeding convolution layer takes the feature maps of all the previous layers as input.
Our proposed modified UNet model follows the original encoder-decoder style UNet model [19], where layers such as convolution, up-convolution, and pooling (kernel size 2 × 2 and a stride of 1) are used. We also incorporated the skip connection for better outcomes. The filter sizes and layer configuration is shown in Figure 2.

D. HIPPOCAMPUS SEGMENTATION
We performed test prediction on the segmented sequential batches and combined them to a hippocampal segmentation. The output against each testing MR images was compared with their corresponding ground truth images. For getting the final segmentation mask, the mean likelihood against each pixel has been computed from the values obtained at the output of softmax layer. The output predicted masks are stored separately against each network i.e., for U-Net and Recursive-UNet outputs.

A. DATASET
We used a total of 263 3D mono modal MRI volumes of hippocampus head and body from the publicly available dataset of Medical Segmentation Decathlon competition MSD [53]. These 3D MRIs consist of axial IS (InterSpace) scans with a dimension of 34 × 47 × 40 respectively, with an image spacing of 1mm in each dimension. The structural data was acquired using an MPRAGE T1-weighted sequence with the following acquisition parameters: TI = 860ms TR = 8.0ms and TE = 3.7 ms, on a Philips Achieva scanner. The MSD challenge was organized by a number of teams and among several data contributors, the hippocampus data has been donated by Vanderbilt University for conceptual design and metrics committee. The data have been annotated and verified by human experts of their respective fields required for precise clinical use.
The segmentation result of our network after comparing with the ground truth labels has been calculated in terms of these evaluation metrics: mean dice score, accuracy, IOU index (Jaccard), positive predictive value (PPV), sensitivity, and specificity against foreground (non-hippocampus region), anterior hippocampus (left) and posterior hippocampus (right) and presented in Table 1. The performance of the resulting output segmentation mask has been evaluated by these quality metrics as described in [54].

B. TRAINING OUR DL MODELS
The network training has been performed in a sliding window procedure that predicts each pixel class label to provide a local region (patch) around that pixel. The network gets localized and the training data patches are much larger than the number of training images. The localization accuracy and the use of context has an inverse relationship. A higher number of patches need to process more max-pooling layers, thereby reducing the localization accuracy. On the other hand, for smaller patches, the network relaxes to process for little contextual data. However, right selection of both trade-offs i.e., localization and context leads to even better results when used at the same time.
A training from scratch with up to 20 epochs per network was performed on a single CPU (core i − 7, 6GB RAM). The test segmentation of the MRI volume (34 × 47 × 40) took about 50s for the 2D axial image. Table 2 presents a comparison of various most recent techniques of hippocampus subfield segmentation with our proposed scheme on the basis of Dice Similarity Coefficient and Jaccard Index, for which our model showed considerably better performance. The best results against each metric are highlighted in bold.
In particular, the nested dilation network (NDN) [15], residual blocks were nested with dilationsfor the segmentation tasks using CT, MRI, and endoscopic images. The data for the hippocampus segmentation task was taken from MSD as in our proposed method. Cao et al. [55] employed a 3D-Unet in multitask deep learning for joint hippocampus segmentation and clinical score regression.The authors evaluated their method on 407 subjects with MRI data from baseline Alzheimer's Disease Neuroimaging Initiative (ADNI). In [56], the authors proposed a combination of U-Seg-Net and Ensemble-Net framework of 110 healthy subjects from the ADNI. A multi-view ensemble approach that relies on neural networks to combine multiple decision maps for hippocampus segmentation was explored. In [25], segmentation masks were generated using an ensemble of three independent models, operating with orthogonal slices of the input volume, while erroneous labels were subsequently corrected by a combination of replace and refine networks. Experiments were performed on MICCAI dataset, achieving a mean Dice value of 0.88 through transfer learning from the larger EADC-ADNI data. In comparison to these ensemble methods which are bound to be computationally expensive our proposed method shows considerably better results for both hippocampus regions.

C. EVALUATION PARAMETERS
In this section we are presenting evaluation metrics for validation of each phase of the proposed system. All metrics can be used either by passing test and reference segmentations as parameters or by passing a confusion matrix object. The later is useful when many metrics need to be computed, because the relevant computations are only done once. All metrics assume binary segmentation inputs. Confusion matrix returns four integer values for true positives, false positives, true negatives and false negatives. For hippocampal region localization and detection we used greedy overlapping criteria of ground truth box and predicted box, known as intersectionover-union (IoU) or Jaccard index coefficient. The correctly predicted box is known as true positive, else false positive, and is computed as follows: For performance evaluation of segmentation phase, we considered the dice score (Dice), Jaccard coefficient (Jc), pixel level specificity (SP), pixel level sensitivity (SE), and pixel level accuracy (Ac) as the evaluation measures.    where TP, TN, FP, and FN represents the number of true positive pixels, true negative pixels, false positive pixels, and false negative pixels, respectively. We chose to implement both basic U-Net and Recursive U-Net network architectures and calculated statistical parameters across all cross folds validation. The network performed efficiently after being trained from 3D mono-modal axial slices, achieving a mean dice score of 0.885±0.01, IOU (Jaccard) of 0.796±0.02 and an accuracy of 0.996 ±0.01 as shown in Table 1. The training loss reduced to 0.0072 and validation loss minimizes to 0.012 at final iteration. We have implemented and trained the network from scratch on Pytorch 1.4.0 (python 3.6) platform. Figure 3 represents the average of Dice Similarity Coefficient (DSC) results for the left and the right hippocampus regions for 5th, 10th, 15th and 20th training samples. The results show that mean Dice scores vary with both, the number of training subjects as well for left and right hippocampus regions respectively. It can be observed that with the increase in number of training samples, the DSC scores tend to improve randomly, however, DSC variations in left (anterior) hippocampus region is slightly at higher end as compared to the right (posterior) hippocampus. The results indicates that our methods using two-side hippocampus segmentation strategy can achieve stable and accurate prediction. The final qualitative segmentation results for left and right hippocampus regions against randomly chosen input source volumes can be seen from Figure 4. The source at (a) has been chosen as grid of four (two pairs) both having left and right respectively, alongwith their corresponding ground truth labels in (b). Both implementation models i.e., basic UNet and Recursive UNet (Rec-UNet) results can be seen with argmax segmentation in respect of the input volumes. The results in terms of all performance metrics for the proposed modified UNet model slightly tends be at the higher side as compared to recursive implementation of our model. The noisy labels as artifacts are observed in many patients, which presented a significant problem during the evaluation of the segmentation. Because of such problematic slices present in the dataset, it was difficult for the proposed method to adequately handle such situations. Secondly, the segmentation of low contrast regions in the hippocampal volumetry has been more challenging due to close contact with the surrounding complex tissues of brain. However, the results for both network models showed considerably better performance when compared with benchmark algorithms in terms of accuracy, sensitivity, specificity, and dice score.

D. DISCUSSION
Multiple atlas based methods perform efficiently to anatomical changes. However, the quality of image registration plays a critical role in its overall performance. Moreover, the effective time being consumed during segmentation has a direct relationship with registration i.e., it increases with the total number of registrations being performed. This segmentation strategy is based on image reconstruction, which is in contrast to the atlas-based labeling approaches that rely on comparing image similarities between atlases and target images. It may take several days to learn very good representative dictionaries and optimal discriminating classifiers offline. While deep neural networks provide superior performance as compared with conventional machine learning algorithms because of its ability to optimally use data representation learning in respect of various related tasks [57]. This technique of representation learning serves to be a significant feature in CNNs. Unlike traditional ML approaches, deep learning via CNNs resolves data computational problems by applying data representations strategies in very simpler ways [58]. Due to reduced hardware computational requirements and processing speed, the use of 2D CNNs still prevails effectively in medical research domain even applied on 3D brain image volumes as well [51]. Recent approaches [59], [60] also contribute towards 3D CNNs for brain image segmentation. In the current neuro-imaging studies where large sample sizes are required, CNN-based algorithms have proved its efficiency and robustness [61].
Herein, we have used a modified version of U-net architecture, hence deploying deep learning for medical image segmentation task. One of the biggest challenge in segmenting the hippocampus region includes the small anatomical structure and variation in the shape of the left and right hippocampus regions. In particular, our proposed method has a significant performance in all major evaluation parameters (Table 1). Further, the results are also significant when compared with other methods presented in literature (Table 2). Hence, we have shown that deep learning can be successfully used for challenging medical image segmentation tasks.

IV. CONCLUSION
In this work, we develop a robust automated segmentation method for hippocampus sub-regions in an MRI based dataset, by using the benchmark algorithm of [19] for bio-medical imaging segmentation task. We trained the network from scratch using 3D mono-modal hippocampus dataset with the technique of data augmentation, spatial elastic deformation and a loss function chosen to be the union of cross-entropy and weighted multi-class dice-loss formulation. The dataset is sufficient for the evaluation of a deep learning based approach leading to promising results. We performed our model testing on the basis of six different evaluation metrics, after a dense training procedure we obtained significant results for all these metrics. In particular, the segmentation accuracy has been very high for both anterior and posterior regions. Various segmentation approaches have been proposed for brain region including tasks such as tissue classification or anatomical structure segmentation of hippocampus region. In comparison our proposed method is found to be effective in segmenting hippocampus region which is evident from our results. In addition, our proposed modified architecture can be utilized effectively for multiple bio-medical image analysis tasks.